Custom Sorting

We wish to sort the rows of a data frame by a transformation of the values of one or more columns instead of by the values themselves. A common example is we want to sort by the numeric part of a string column.

While we can create a new column by transforming existing columns, there are situations were we don’t wish to keep an intermediate value column that we only need for sorting.

We will cover the following custom sorting scenarios:

  • Single Column: We wish to sort the rows of a data frame by a transformation of one column.
  • Multiple Columns: We wish to sort the rows of a data frame by a transformation of multiple columns. In particular, we wish to apply the same transformation to each of a set of columns individually then sort by the transformed columns.
  • Multiple Transformations: We wish to sort the rows of a data frame by transformations of multiple columns. In particular, we wish to apply a different transformation to each of a set of columns then sort by the transformed columns.
  • Multivariate Transformation: We wish to sort the rows of a data frame by a function of multiple columns. In particular, we wish to sort by one variable that is the output of a function of multiple columns. For example, we wish to sort by the ratio of two columns.
  • Non-Vectorized Transformation:We wish to sort the rows of a data frame by an operation that needs to be performed in a non-vectorized manner.
  • Missing Values: Rows where the sorting columns have missing values are sorted at the end by default. In some situations, we may want to sort them differently e.g. at the beginning.

Single Column

We wish to sort the rows of a data frame by a transformation of one column.

In this example, we wish to sort the data frame df in ascending order of the first numerical substring (sequence of digits) of the string column col_1 in numeric order (not character order).

df_2 = df %>% 
  arrange(
    col_1 %>%
      str_extract('\\d+') %>% 
      parse_number()
  )

Here is how this works:

  • In order to sort by the numeric part of a string, we extract the numeric part then convert its data type (type cast it) to a numeric type then use that resulting numeric variable to sort the rows of the data frame.
  • As noted above, column col_1 is comprised of values of the format order_<val> where val can be any integer e.g. ‘order_25'. We use str_replace() to remove the character prefix ‘order_’ from every value in the column col_1. This leaves the numeric part e.g. ‘order_25’ becomes ‘25’. For a coverage of working with strings in R see String Operations.
  • We then use parse_number() to convert the data type of the remaining numeric string to a numeric data type.
  • The output of the above two operations is a numeric variable that is then used by arrange() to sort the rows of the data frame df.
  • A transformed version of col_1 is created, used for sorting, then discarded. The original values of col_1 stay as is.
  • For readability and fault tolerance, we applied the transformations as a chain that takes col_1 passes it via the pipe %>% to str_replace() and passes the output of that via the pipe %>% to parse_number(). This is equivalent to parse_number(str_replace(col_1, 'order_', '')).

Multiple Columns

We wish to sort the rows of a data frame by a transformation of multiple columns. In particular, we wish to apply the same transformation to each of a set of columns individually then sort by the transformed columns.

In this example, we wish to sort the rows of a data frame df by the lower case transformation of columns col_1 and col_2.

df_2 = df %>% 
  arrange(str_to_lower(col_1), str_to_lower(col_2))

Here is how this works:

  • This works similarly to the Case Insensitive sorting example described above.
  • Each of the columns col_1 and col_2 is transformed individually and the data frame df is arranged in ascending order of the lower case transformation of col_1 then rows which have the same lower case value for col_1 are sorted in ascending order of the lower case transformation of col_2.

Multiple Transformations

We wish to sort the rows of a data frame by transformations of multiple columns. In particular, we wish to apply a different transformation to each of a set of columns then sort by the transformed columns.

In this example, we wish to sort the rows of a data frame df in ascending order of the lower case transformation of the column col_1 and the suffix after the underscore in col_2.

df_2 = df %>% 
  arrange(str_to_lower(col_1), 
          str_replace_all(col_2, '.*_', ''))

Here is how this works:

  • This works similarly to the Case Insensitive sorting example described above.
  • Each sorting variable passed to arrange() can be a distinct transformation of any one or more column.
  • In this example the first sorting variable is a lower case transformation of col_1 and the second sorting variable is a substring of col_2 (whatever is after the underscore _).

Multivariate Transformation

We wish to sort the rows of a data frame by a function of multiple columns. In particular, we wish to sort by one variable that is the output of a function of multiple columns. For example, we wish to sort by the ratio of two columns.

In this example, we wish to sort the rows of a data frame df by the ratio between two columns col_1 and col_2.

df_2 = df %>% 
  arrange(col_1 / col_2)

Here is how this works:

  • We can transformation logic involving multiple columns inside of arrange() just like we would inside of mutate().
  • The values of col_1 are divided by the corresponding values of col_2. The resulting ratio is then used to sort the rows of the data frame df in ascending order.

Non-Vectorized Transformation

We wish to sort the rows of a data frame by an operation that needs to be performed in a non-vectorized manner.

In this example, we wish to sort the rows of the data frame df by the estimated mean after performing a given number of draws from a normal distribution. The number of draws is provided by the value of the column col_1. Essentially, we are showing that the larger the sample size the closer the estimate of the mean to the actual mean (which is zero in this case).

df_2 = df %>% 
  arrange(map_dbl(col_1, ~ abs(mean(rnorm(., 0, 10)))))

Here is how this works:

  • In order to perform a non-vectorized transformation and use the output for sorting, we use a mapping function to iterate over the relevant column (or columns) of interest and execute the non-vectorized transformation once for each row and use the output for sorting.
  • In this example, we use map_dbl() because the output we expect has a double data type. See Working with Lists.
  • For each value of the column col_1, map_dbl() will execute the anonymous function ~ abs(mean(rnorm(., 0, 10))), which breaks down as follows:
  • abs() returns the absolute value. See Numerical Operations.
  • mean() returns the mean of a list of values. See Summary Statistics.
  • rnorm(n, mean, sd) produces n values from a normal distribution with a mean of mean and a standard deviation of sd. See Distributions

Missing Values

When we sort on a column with missing data, the rows with the missing values for the sort column will appear at the end. This happens regardless of whether we are sorting in ascending or descending order. In this section we cover how to change this behavior and show rows with missing values at the top (instead of at the bottom).

df_2 = df %>% 
  arrange(!is.na(col_1), col_1)

Here is how this works:

  • arrange() has no parameter to control whether NAs are sorted at the top or the bottom. Therefore, we need to do a bit of extra work.
  • We pass !is.na(col_1) as the first input to arrange(). !is.na(col_1) is FALSE when the value of col_1 is NA and TRUE otherwise. Since FALSE sorts higher than TRUE, all NA values will be sorted first.
  • The second argument to arrange() is col_1. Therefore, rows where !is.na(col_1) is TRUE i.e. not NA will then be sorted by the value of col_1 which is the desired outcome.
  • If we simply wish to show missing values on top without sorting the rest of the data frame, we could use df %>% arrange(!is.na(col_6)).

Alternatively,

df_2 = df %>% 
  arrange(order(str_order(col_6, na_last=FALSE)))

Here is how this works:

  • We use the str_order() function from the stringr package which when its argument na_last=FALSE sorts a string vector while placing missing values NA at the top.
  • This is applicable when the sorting column is of string (character) data type.
R
I/O