We wish to sort the rows of a data frame by a transformation of the values of one or more columns instead of by the values themselves. A common example is we want to sort by the numeric part of a string column.

While we can create a new column by transforming existing columns, there are situations were we don’t wish to keep an intermediate value column that we only need for sorting.

We will cover the following custom sorting scenarios:

**Single Column**: We wish to sort the rows of a data frame by a transformation of one column. We will cover two common situations: (a) We wish to sort a string column in a case-insensitive manner and (b) We wish to sort by the numeric part of a string column.**Multiple Columns**: We wish to sort the rows of a data frame by a transformation of multiple columns. In particular, we wish to apply the same transformation to each of a set of columns individually then sort by the transformed columns.**Multiple Transformations:**We wish to sort the rows of a data frame by transformations of multiple columns. In particular, we wish to apply a different transformation to each of a set of columns then sort by the transformed columns.**Multivariate Transformation**: We wish to sort the rows of a data frame by a function of multiple columns. In particular, we wish to sort by one variable that is the output of a function of multiple columns. For example, we wish to sort by the ratio of two columns.**Non-Vectorized Transformation**:We wish to sort the rows of a data frame by an operation that needs to be performed in a non-vectorized manner.**Missing Values**: Rows where the sorting columns have missing values are sorted at the end by default. In some situations, we may want to sort them differently e.g. at the beginning.

We wish to sort the rows of a data frame by a transformation of one column. We will cover two common situations:

- Case Insensitive: We wish to sort a string column in a case-insensitive manner.
- Numeric String: We wish to sort by the numeric part of a string column.

**Case Insensitive**

We wish to sort by a string column in a **case-insensitive** manner i.e. for the purpose of sorting, we wish to treat upper and lower case letters as the same. With string columns, *sorting is case-sensitive by default*, meaning upper case text will appear first when sorting in ascending order and last when sorting in descending order.

In this example, we wish to sort the data frame `df`

in ascending order of a string column with the name `‘col_1'`

in a *case-insensitive* manner.

```
df.sort_values(
by='col_1',
key=lambda x: x.str.lower(),
)
```

Here is how this works:

- To sort by the transformed values of a column, we pass the transformation function that we wish to carry out to the
`key`

argument of`sort_values()`

. - The transformation function we pass to the
`key`

argument of`sort_values()`

should be*vectorized*i.e. it accepts a vector (a column) and returns a vector of the same length. - We use the
`Series`

method`str.lower()`

(inside a`lambda`

function) to covert all characters to lower case. For a coverage of working with strings in Pandas see String Operations. - An all lower case version of
`col_1`

is created, used for sorting, then discarded. The original values of`col_1`

stay as is.

**Numeric String**

We wish to sort by the numerical component of a string column. In addition, we wish to order the numeric characters in a numeric order not a character order. For example, the ascending order should be `‘9’`

then `‘10’`

and not `‘10’`

then `‘9’`

.

In this example, we wish to sort the data frame `df`

in ascending order of the numerical substring of a string column with the name `‘col_1'`

in numeric order (not character order). The values of `‘col_1'`

are of the format `order_<val>`

where `val`

can be any integer e.g. `‘order_25'`

.

```
df.sort_values(
by='col_1',
key=lambda x: x.str.replace('order_', '').astype(int),
)
```

Here is how this works:

- In order to sort by the numeric part of a string, we extract the numeric part then convert its data type (type cast it) to a numeric type then use that resulting numeric variable to sort the rows of the data frame.
- As noted above, column
`col_1`

consists of values of the format`order_<val>`

where`val`

can be any integer e.g.`‘order_25'`

. We use`str.replace()`

to remove the character prefix`‘order_’`

from every value in the column`col_1`

. This leaves the numeric part e.g.`‘order_25’`

becomes`‘25’`

. For a coverage of working with strings in Pandas see String Operations. - We then use
`astype(int)`

to convert the data type of the remaining numeric string to a numeric data type. - As described in the
*Case Insensitive*sorting example above, we pass the transformation function that we wish to apply to our sorting column to the`key`

argument of`sort_values()`

. - A transformed version of
`col_1`

is created, used for sorting, then discarded. The original values of`col_1`

stay as is.

We wish to sort the rows of a data frame by a transformation of multiple columns. In particular, we wish to apply the same transformation to each of a set of columns individually then sort by the transformed columns.

In this example, we wish to sort the rows of a data frame `df`

by the lower case transformation of columns `col_1`

and `col_2`

.

```
df.sort_values(
by=['col_1', 'col_2'],
key=lambda x: x.str.lower(),
)
```

Here is how this works:

- This works similarly to the
*Case Insensitive*sorting example described above. - The vectorized transformation function passed to the
`key`

argument of`sort_values()`

is applied to each column in`by`

independently. - Each of the columns
`col_1`

and`col_2`

is transformed individually and the data frame`df`

is arranged in ascending order of the lower case transformation of`col_1`

then rows which have the same lower case value for`col_1`

are sorted in ascending order of the lower case transformation of`col_2`

.

We wish to sort the rows of a data frame by transformations of multiple columns. In particular, we wish to apply a different transformation to each of a set of columns then sort by the transformed columns.

In this example, we wish to sort the rows of a data frame `df`

in ascending order of the lower case transformation of the column `col_1`

and the suffix after the underscore in `col_2`

.

```
def sort_func(x):
if x.name == 'col_1':
return x.str.lower()
elif x.name == 'col_2':
return x.str.replace('.*_', '', regex=True)
else:
return x
df.sort_values(
by=['col_1', 'col_2'],
key=sort_func
)
```

Here is how this works:

- The
`key`

argument of`sort_values()`

accepts only one function. Therefore, to apply different transformation functions to different sorting columns, we need to do a bit of extra work. - The columns passed to the key function are passed with their name. Therefore, we can check for column name inside the function then apply a different transformation for different columns.
- In this example the
`sort_func()`

function checks for column name and applies the corresponding transformation function like so:- If the column name is
`‘col_1’`

it converts the column to lower case - If the column name is
`‘col_2’`

it removes the prefix that is before the underscore. - If the column name is anything else, it is returned as is.

- If the column name is

We wish to sort the rows of a data frame by a function of multiple columns. In particular, we wish to sort by one variable that is the output of a function of multiple columns. For example, we wish to sort by the ratio of two columns.

In this example, we wish to sort the rows of a data frame `df`

by the ratio between two columns `col_1`

and `col_2`

.

```
(df
.assign(sort_col=lambda x: x['col_1'] / x['col_7'])
.sort_values('sort_col')
.drop('sort_col', axis=1)
)
```

Here is how this works:

- A function passed to the
`key`

argument of`sort_values()`

is applied to each column individually. Therefore, if we wish to apply a transformation that involves multiple columns, we can’t use`sort_values()`

to carry out that*multivariate*transformation. - The way around is to use assign to create the sort column (which here is called
`sort_col`

) and then sort by it via`sort_values()`

. - If we do not need to keep the column we just created for sorting, we use
`drop()`

to remove it from the data frame`df`

. See Exclude Columns for a coverage of excluding data frame columns. - On a related note, it is good practice, to ensure that the code is producing its intended outcome, to compute and look at the sorting column during development even if we do not wish to keep it in the final state.

We wish to sort the rows of a data frame by an operation that needs to be performed in a non-vectorized manner.

In this example, we wish to sort the rows of the data frame `df`

by the index of where a given regular expression appears first in the corresponding value of the column `col_1`

.

```
import re
df_2 = df \
.sort_values(by='col_1',
key=lambda col: col.apply(lambda x: re.search(r"\bis\b", x).start()))
```

Here is how this works:

- In order to perform a non-vectorized transformation and use the output for sorting, we need to use nested lambda functions. The outer function receives the column to be sorted by as a Series while the inner function iterates over the values of a column and execute the non-vectorized transformation once for each row. The output is then used for sorting.
- In this example, we use
`map_dbl()`

because the output we expect has a double data type. See Working with Lists. - For each value of the column
`col_1`

,`map_dbl()`

will execute the anonymous function`~ abs(mean(rnorm(., 0, 10)))`

, which breaks down as follows: `abs()`

returns the absolute value. See Numerical Operations.`mean()`

returns the mean of a list of values. See Summary Statistics.`rnorm(n, mean, sd)`

produces`n`

values from a normal distribution with a mean of`mean`

and a standard deviation of`sd`

. See Distributions

*Alternative: Via a Function*

```
import re
def locate_regex(p_col, p_regex):
regex_loc = p_col.apply(lambda x: re.search(p_regex, x).start())
return regex_loc
df_2 = df \
.sort_values(by='col_1',
key=lambda col: locate_regex(col, r"\bis\b"))
```

Here is how this works:

This works similarly to the primary solution above except that we isolate the inner non-vectorized transformation into a separate function. This may offer better readability in some situations.

When we sort on a column with missing data, the rows with the missing values for the sort column will appear at the end. This happens regardless of whether we are sorting in ascending or descending order. In this section we cover how to change this behavior and show rows with missing values at the top (instead of at the bottom).

```
df.sort_values(by='col_1', na_position='first')
```

Here is how this works:

- The
`sort_values()`

function has an`na_position`

argument that takes the value`na_position=’last’`

by default. - To have missing values sorted at the beginning (the top of the resulting data frame), we set
`na_position=’first’`

.

PYTHON