We wish to sort the rows of a data frame by a transformation of the values of one or more columns instead of by the values themselves. A common example is we want to sort by the numeric part of a string column.
While we can create a new column by transforming existing columns, there are situations were we don’t wish to keep an intermediate value column that we only need for sorting.
We will cover the following custom sorting scenarios:
We wish to sort the rows of a data frame by a transformation of one column. We will cover two common situations:
Case Insensitive
We wish to sort by a string column in a case-insensitive manner i.e. for the purpose of sorting, we wish to treat upper and lower case letters as the same. With string columns, sorting is case-sensitive by default, meaning upper case text will appear first when sorting in ascending order and last when sorting in descending order.
In this example, we wish to sort the data frame df
in ascending order of a string column with the name ‘col_1'
in a case-insensitive manner.
df.sort_values(
by='col_1',
key=lambda x: x.str.lower(),
)
Here is how this works:
key
argument of sort_values()
.key
argument of sort_values()
should be vectorized i.e. it accepts a vector (a column) and returns a vector of the same length.Series
method str.lower()
(inside a lambda
function) to covert all characters to lower case. For a coverage of working with strings in Pandas see String Operations.col_1
is created, used for sorting, then discarded. The original values of col_1
stay as is.Numeric String
We wish to sort by the numerical component of a string column. In addition, we wish to order the numeric characters in a numeric order not a character order. For example, the ascending order should be ‘9’
then ‘10’
and not ‘10’
then ‘9’
.
In this example, we wish to sort the data frame df
in ascending order of the numerical substring of a string column with the name ‘col_1'
in numeric order (not character order). The values of ‘col_1'
are of the format order_<val>
where val
can be any integer e.g. ‘order_25'
.
df.sort_values(
by='col_1',
key=lambda x: x.str.replace('order_', '').astype(int),
)
Here is how this works:
col_1
consists of values of the format order_<val>
where val
can be any integer e.g. ‘order_25'
. We use str.replace()
to remove the character prefix ‘order_’
from every value in the column col_1
. This leaves the numeric part e.g. ‘order_25’
becomes ‘25’
. For a coverage of working with strings in Pandas see String Operations.astype(int)
to convert the data type of the remaining numeric string to a numeric data type.key
argument of sort_values()
.col_1
is created, used for sorting, then discarded. The original values of col_1
stay as is.We wish to sort the rows of a data frame by a transformation of multiple columns. In particular, we wish to apply the same transformation to each of a set of columns individually then sort by the transformed columns.
In this example, we wish to sort the rows of a data frame df
by the lower case transformation of columns col_1
and col_2
.
df.sort_values(
by=['col_1', 'col_2'],
key=lambda x: x.str.lower(),
)
Here is how this works:
key
argument of sort_values()
is applied to each column in by
independently.col_1
and col_2
is transformed individually and the data frame df
is arranged in ascending order of the lower case transformation of col_1
then rows which have the same lower case value for col_1
are sorted in ascending order of the lower case transformation of col_2
.We wish to sort the rows of a data frame by transformations of multiple columns. In particular, we wish to apply a different transformation to each of a set of columns then sort by the transformed columns.
In this example, we wish to sort the rows of a data frame df
in ascending order of the lower case transformation of the column col_1
and the suffix after the underscore in col_2
.
def sort_func(x):
if x.name == 'col_1':
return x.str.lower()
elif x.name == 'col_2':
return x.str.replace('.*_', '', regex=True)
else:
return x
df.sort_values(
by=['col_1', 'col_2'],
key=sort_func
)
Here is how this works:
key
argument of sort_values()
accepts only one function. Therefore, to apply different transformation functions to different sorting columns, we need to do a bit of extra work.sort_func()
function checks for column name and applies the corresponding transformation function like so:‘col_1’
it converts the column to lower case‘col_2’
it removes the prefix that is before the underscore.We wish to sort the rows of a data frame by a function of multiple columns. In particular, we wish to sort by one variable that is the output of a function of multiple columns. For example, we wish to sort by the ratio of two columns.
In this example, we wish to sort the rows of a data frame df
by the ratio between two columns col_1
and col_2
.
(df
.assign(sort_col=lambda x: x['col_1'] / x['col_7'])
.sort_values('sort_col')
.drop('sort_col', axis=1)
)
Here is how this works:
key
argument of sort_values()
is applied to each column individually. Therefore, if we wish to apply a transformation that involves multiple columns, we can’t use sort_values()
to carry out that multivariate transformation.sort_col
) and then sort by it via sort_values()
.drop()
to remove it from the data frame df
. See Exclude Columns for a coverage of excluding data frame columns.We wish to sort the rows of a data frame by an operation that needs to be performed in a non-vectorized manner.
In this example, we wish to sort the rows of the data frame df
by the index of where a given regular expression appears first in the corresponding value of the column col_1
.
import re
df_2 = df \
.sort_values(by='col_1',
key=lambda col: col.apply(lambda x: re.search(r"\bis\b", x).start()))
Here is how this works:
map_dbl()
because the output we expect has a double data type. See Working with Lists.col_1
, map_dbl()
will execute the anonymous function ~ abs(mean(rnorm(., 0, 10)))
, which breaks down as follows:abs()
returns the absolute value. See Numerical Operations.mean()
returns the mean of a list of values. See Summary Statistics.rnorm(n, mean, sd)
produces n
values from a normal distribution with a mean of mean
and a standard deviation of sd
. See DistributionsAlternative: Via a Function
import re
def locate_regex(p_col, p_regex):
regex_loc = p_col.apply(lambda x: re.search(p_regex, x).start())
return regex_loc
df_2 = df \
.sort_values(by='col_1',
key=lambda col: locate_regex(col, r"\bis\b"))
Here is how this works:
This works similarly to the primary solution above except that we isolate the inner non-vectorized transformation into a separate function. This may offer better readability in some situations.
When we sort on a column with missing data, the rows with the missing values for the sort column will appear at the end. This happens regardless of whether we are sorting in ascending or descending order. In this section we cover how to change this behavior and show rows with missing values at the top (instead of at the bottom).
df.sort_values(by='col_1', na_position='first')
Here is how this works:
sort_values()
function has an na_position
argument that takes the value na_position=’last’
by default.na_position=’first’
.