Categorical Sorting

We wish to sort an ordinal categorical variable in a domain accurate manner not in alphanumeric order. An ordinal categorical variable is one where there is a natural order e.g. responses in a survey.

In this example, we have a data frame df that has a column col_1 which is categorical variable holding t-shirt sizes XS, S, M, L, XL. We wish to sort the rows of the data frame df in the natural order of the t-shirt sizes i.e. XS < S < M < L < XL.

sizes = ['XS', 'S', 'M', 'L', 'XL']

df_2 = df\
    .assign(col_1=pd.Categorical(df['col_1'],
                                     categories=sizes,
                                     ordered=True))\
    .sort_values('col_1')

Here is how this works:

  • If the column we wish to sort by is already type cast to an ordered Categorical with the levels correctly defined, we can simply apply sort_values() and the data frame will be sorted according to the defined order of the ordered categorical variable.
  • To convert a string column to a Pandas Categorical data type we use the Pandas.Categorical() function (here referred to as pd.Categorical()) which takes the original column (which is here col_1), the levels or categories (which here is the list sizes), and a parameter ordered that determines if the factor is ordered (ordinal) if we set ordered=True or unordered (nominal) if we set ordered=False (which here is set to ordered=True). See Factor Operations for more details.

Alternative: Value Mapping

def sort_func(x):
    conditions = [
        x.eq('XS'),
        x.eq('S'),
        x.eq('M'),
        x.eq('L'),
        x.eq('XL')]
    choices = list(range(0, 5))
    y = np.select(conditions, choices, default=5)
    return y

df_2 = df.sort_values(
    by='col_1',
    key=sort_func
)

Here is how this works:

  • In some situations, converting a variable to a categorical data type is not appropriate.
  • In those situations we can use a function passed to the key attribute of sort_values() to define how each possible value of the sorting column is to be sorted relative to the other values.
  • We used the select() function from NumPy which takes
  • a set of conditions applied to an input Series. The input Series here is the argument x of the function sort_func() to which we pass col_1. col_1 holds the t-shirt sizes.
  • a corresponding set of choices which here is the integer order we wish to assign to each possible value of the input series (0 for 'XS' through to 4 for 'XL')
  • a default choice which we set to the integer 5 so any unknown "size" (or typo) would be sorted at the end.
  • select() outputs another Series (which here is y) of the same size as the input Series where the values are the choices that correspond to the matching conditions applied to the input. See General Operations for a coverage of conditional statements.
PYTHON
I/O