Sorting by Implicitly Selected Columns

We wish to identify the columns to use for sorting, not by explicitly spelling out their names but by specifying criteria satisfied by the desired columns.

In this example we wish to sort the rows of the data frame df by all the columns whose names start with the string prefix cvr_. Note: We are assuming that the order of the columns is appropriate for the task at hand.

def m_sort_values(df, select_fn):
    selected_cols = select_fn(df).to_list()
    return df.sort_values(by=selected_cols)

df_2 = df \
    .pipe(m_sort_values,
          lambda x: x.columns[x.columns.str.startswith('cvr_')])

Here is how this works:

  • sort_values() can’t take a callable for the by argument. We can work around that by creating a custom function (where we call here m_sort_values()) that takes a data frame and a column selection function and handles the column selection then data frame sorting.
  • We use the pipe to pass the data frame (which here is df) and the column selection lambda function to our custom sorting function m_sort_values().
  • We used str.startswith() to select all columns whose names start with the string suffix ‘cvr_’. See Implicit Selection for a coverage of the most common scenarios of implicit column selection including by name pattern, data type, and Criteria satisfied by the column’s data.
  • We use to_list() to convert the Index returned by the columns attribute of a data frame to a list that can be passed to sort_values().

Alternatively,

selected_cols = df.columns[df.columns.str.startswith('cvr_')].to_list()
df.sort_values(by=selected_cols)

Here is how this works:

  • The column selection logic is the same as above.
  • If applying the sorting operation in a chain is not an issue, we can use this simpler solution.
PYTHON
I/O