We wish to sort an ordinal categorical variable in a domain accurate manner not in alphanumeric order. An ordinal categorical variable is one where there is a natural order e.g. responses in a survey.
In this example, we have a data frame df
that has a column col_1
which is categorical variable holding t-shirt sizes XS, S, M, L, XL. We wish to sort the rows of the data frame df
in the natural order of the t-shirt sizes i.e. XS < S < M < L < XL.
sizes = ['XS', 'S', 'M', 'L', 'XL']
df_2 = df\
.assign(col_1=pd.Categorical(df['col_1'],
categories=sizes,
ordered=True))\
.sort_values('col_1')
Here is how this works:
Categorical
with the levels correctly defined, we can simply apply sort_values()
and the data frame will be sorted according to the defined order of the ordered categorical variable.Pandas.Categorical()
function (here referred to as pd.Categorical()
) which takes the original column (which is here col_1
), the levels or categories (which here is the list sizes
), and a parameter ordered
that determines if the factor is ordered (ordinal) if we set ordered=True
or unordered (nominal) if we set ordered=False
(which here is set to ordered=True
). See Factor Operations for more details.Alternative: Value Mapping
def sort_func(x):
conditions = [
x.eq('XS'),
x.eq('S'),
x.eq('M'),
x.eq('L'),
x.eq('XL')]
choices = list(range(0, 5))
y = np.select(conditions, choices, default=5)
return y
df_2 = df.sort_values(
by='col_1',
key=sort_func
)
Here is how this works:
key
attribute of sort_values()
to define how each possible value of the sorting column is to be sorted relative to the other values.select()
function from NumPy
which takesconditions
applied to an input Series
. The input Series
here is the argument x
of the function sort_func()
to which we pass col_1
. col_1
holds the t-shirt sizes.choices
which here is the integer order we wish to assign to each possible value of the input series (0 for 'XS'
through to 4 for 'XL'
) default
choice which we set to the integer 5 so any unknown "size" (or typo) would be sorted at the end. select()
outputs another Series
(which here is y
) of the same size as the input Series
where the values are the choices that correspond to the matching conditions applied to the input. See General Operations for a coverage of conditional statements.