Custom Sorting

We wish to sort the rows of a data frame by a transformation of the values of one or more columns instead of by the values themselves. A common example is we want to sort by the numeric part of a string column.

While we can create a new column by transforming existing columns, there are situations were we don’t wish to keep an intermediate value column that we only need for sorting.

We will cover the following custom sorting scenarios:

  • Single Column: We wish to sort the rows of a data frame by a transformation of one column. We will cover two common situations: (a) We wish to sort a string column in a case-insensitive manner and (b) We wish to sort by the numeric part of a string column.
  • Multiple Columns: We wish to sort the rows of a data frame by a transformation of multiple columns. In particular, we wish to apply the same transformation to each of a set of columns individually then sort by the transformed columns.
  • Multiple Transformations: We wish to sort the rows of a data frame by transformations of multiple columns. In particular, we wish to apply a different transformation to each of a set of columns then sort by the transformed columns.
  • Multivariate Transformation: We wish to sort the rows of a data frame by a function of multiple columns. In particular, we wish to sort by one variable that is the output of a function of multiple columns. For example, we wish to sort by the ratio of two columns.
  • Non-Vectorized Transformation:We wish to sort the rows of a data frame by an operation that needs to be performed in a non-vectorized manner.
  • Missing Values: Rows where the sorting columns have missing values are sorted at the end by default. In some situations, we may want to sort them differently e.g. at the beginning.

Single Column

We wish to sort the rows of a data frame by a transformation of one column. We will cover two common situations:

  • Case Insensitive: We wish to sort a string column in a case-insensitive manner.
  • Numeric String: We wish to sort by the numeric part of a string column.

Case Insensitive

We wish to sort by a string column in a case-insensitive manner i.e. for the purpose of sorting, we wish to treat upper and lower case letters as the same. With string columns, sorting is case-sensitive by default, meaning upper case text will appear first when sorting in ascending order and last when sorting in descending order.

In this example, we wish to sort the data frame df in ascending order of a string column with the name ‘col_1' in a case-insensitive manner.

df.sort_values(
    by='col_1',
    key=lambda x: x.str.lower(),
)

Here is how this works:

  • To sort by the transformed values of a column, we pass the transformation function that we wish to carry out to the key argument of sort_values().
  • The transformation function we pass to the key argument of sort_values() should be vectorized i.e. it accepts a vector (a column) and returns a vector of the same length.
  • We use the Series method str.lower() (inside a lambda function) to covert all characters to lower case. For a coverage of working with strings in Pandas see String Operations.
  • An all lower case version of col_1 is created, used for sorting, then discarded. The original values of col_1 stay as is.

Numeric String

We wish to sort by the numerical component of a string column. In addition, we wish to order the numeric characters in a numeric order not a character order. For example, the ascending order should be ‘9’ then ‘10’ and not ‘10’ then ‘9’.

In this example, we wish to sort the data frame df in ascending order of the numerical substring of a string column with the name ‘col_1' in numeric order (not character order). The values of ‘col_1' are of the format order_<val> where val can be any integer e.g. ‘order_25'.

df.sort_values(
    by='col_1',
    key=lambda x: x.str.replace('order_', '').astype(int),
)

Here is how this works:

  • In order to sort by the numeric part of a string, we extract the numeric part then convert its data type (type cast it) to a numeric type then use that resulting numeric variable to sort the rows of the data frame.
  • As noted above, column col_1 consists of values of the format order_<val> where val can be any integer e.g. ‘order_25'. We use str.replace() to remove the character prefix ‘order_’ from every value in the column col_1. This leaves the numeric part e.g. ‘order_25’ becomes ‘25’. For a coverage of working with strings in Pandas see String Operations.
  • We then use astype(int) to convert the data type of the remaining numeric string to a numeric data type.
  • As described in the Case Insensitive sorting example above, we pass the transformation function that we wish to apply to our sorting column to the key argument of sort_values().
  • A transformed version of col_1 is created, used for sorting, then discarded. The original values of col_1 stay as is.

Multiple Columns

We wish to sort the rows of a data frame by a transformation of multiple columns. In particular, we wish to apply the same transformation to each of a set of columns individually then sort by the transformed columns.

In this example, we wish to sort the rows of a data frame df by the lower case transformation of columns col_1 and col_2.

df.sort_values(
    by=['col_1', 'col_2'],
    key=lambda x: x.str.lower(),
)

Here is how this works:

  • This works similarly to the Case Insensitive sorting example described above.
  • The vectorized transformation function passed to the key argument of sort_values() is applied to each column in by independently.
  • Each of the columns col_1 and col_2 is transformed individually and the data frame df is arranged in ascending order of the lower case transformation of col_1 then rows which have the same lower case value for col_1 are sorted in ascending order of the lower case transformation of col_2.

Multiple Transformations

We wish to sort the rows of a data frame by transformations of multiple columns. In particular, we wish to apply a different transformation to each of a set of columns then sort by the transformed columns.

In this example, we wish to sort the rows of a data frame df in ascending order of the lower case transformation of the column col_1 and the suffix after the underscore in col_2.

def sort_func(x):
    if x.name == 'col_1':
        return x.str.lower()
    elif x.name == 'col_2':
        return x.str.replace('.*_', '', regex=True)
    else:
        return x

df.sort_values(
   by=['col_1', 'col_2'],
   key=sort_func
)

Here is how this works:

  • The key argument of sort_values() accepts only one function. Therefore, to apply different transformation functions to different sorting columns, we need to do a bit of extra work.
  • The columns passed to the key function are passed with their name. Therefore, we can check for column name inside the function then apply a different transformation for different columns.
  • In this example the sort_func() function checks for column name and applies the corresponding transformation function like so:
    • If the column name is ‘col_1’ it converts the column to lower case
    • If the column name is ‘col_2’ it removes the prefix that is before the underscore.
    • If the column name is anything else, it is returned as is.

Multivariate Transformation

We wish to sort the rows of a data frame by a function of multiple columns. In particular, we wish to sort by one variable that is the output of a function of multiple columns. For example, we wish to sort by the ratio of two columns.

In this example, we wish to sort the rows of a data frame df by the ratio between two columns col_1 and col_2.

(df
 .assign(sort_col=lambda x: x['col_1'] / x['col_7'])
 .sort_values('sort_col')
 .drop('sort_col', axis=1)
 )

Here is how this works:

  • A function passed to the key argument of sort_values() is applied to each column individually. Therefore, if we wish to apply a transformation that involves multiple columns, we can’t use sort_values() to carry out that multivariate transformation.
  • The way around is to use assign to create the sort column (which here is called sort_col) and then sort by it via sort_values().
  • If we do not need to keep the column we just created for sorting, we use drop() to remove it from the data frame df. See Exclude Columns for a coverage of excluding data frame columns.
  • On a related note, it is good practice, to ensure that the code is producing its intended outcome, to compute and look at the sorting column during development even if we do not wish to keep it in the final state.

Non-Vectorized Transformation

We wish to sort the rows of a data frame by an operation that needs to be performed in a non-vectorized manner.

In this example, we wish to sort the rows of the data frame df by the index of where a given regular expression appears first in the corresponding value of the column col_1.

import re

df_2 = df \
    .sort_values(by='col_1',
                 key=lambda col: col.apply(lambda x: re.search(r"\bis\b", x).start()))

Here is how this works:

  • In order to perform a non-vectorized transformation and use the output for sorting, we need to use nested lambda functions. The outer function receives the column to be sorted by as a Series while the inner function iterates over the values of a column and execute the non-vectorized transformation once for each row. The output is then used for sorting.
  • In this example, we use map_dbl() because the output we expect has a double data type. See Working with Lists.
  • For each value of the column col_1, map_dbl() will execute the anonymous function ~ abs(mean(rnorm(., 0, 10))), which breaks down as follows:
  • abs() returns the absolute value. See Numerical Operations.
  • mean() returns the mean of a list of values. See Summary Statistics.
  • rnorm(n, mean, sd) produces n values from a normal distribution with a mean of mean and a standard deviation of sd. See Distributions

Alternative: Via a Function

import re

def locate_regex(p_col, p_regex):
    regex_loc = p_col.apply(lambda x: re.search(p_regex, x).start())
    return regex_loc

df_2 = df \
    .sort_values(by='col_1',
                 key=lambda col: locate_regex(col, r"\bis\b"))

Here is how this works:

This works similarly to the primary solution above except that we isolate the inner non-vectorized transformation into a separate function. This may offer better readability in some situations.

Missing Values

When we sort on a column with missing data, the rows with the missing values for the sort column will appear at the end. This happens regardless of whether we are sorting in ascending or descending order. In this section we cover how to change this behavior and show rows with missing values at the top (instead of at the bottom).

df.sort_values(by='col_1', na_position='first')

Here is how this works:

  • The sort_values() function has an na_position argument that takes the value na_position=’last’ by default.
  • To have missing values sorted at the beginning (the top of the resulting data frame), we set na_position=’first’.
PYTHON
I/O