We wish to sort the rows of a data frame by a transformation of the values of one or more columns instead of by the values themselves. A common example is we want to sort by the numeric part of a string column.
While we can create a new column by transforming existing columns, there are situations were we don’t wish to keep an intermediate value column that we only need for sorting.
We will cover the following custom sorting scenarios:
We wish to sort the rows of a data frame by a transformation of one column.
In this example, we wish to sort the data frame df
in ascending order of the first numerical substring (sequence of digits) of the string column col_1
in numeric order (not character order).
df_2 = df %>%
arrange(
col_1 %>%
str_extract('\\d+') %>%
parse_number()
)
Here is how this works:
col_1
is comprised of values of the format order_<val>
where val
can be any integer e.g. ‘order_25'
. We use str_replace()
to remove the character prefix ‘order_’
from every value in the column col_1
. This leaves the numeric part e.g. ‘order_25’
becomes ‘25’
. For a coverage of working with strings in R see String Operations.parse_number()
to convert the data type of the remaining numeric string to a numeric data type.arrange()
to sort the rows of the data frame df
.col_1
is created, used for sorting, then discarded. The original values of col_1
stay as is.col_1
passes it via the pipe %>%
to str_replace()
and passes the output of that via the pipe %>%
to parse_number()
. This is equivalent to parse_number(str_replace(col_1, 'order_', ''))
.We wish to sort the rows of a data frame by a transformation of multiple columns. In particular, we wish to apply the same transformation to each of a set of columns individually then sort by the transformed columns.
In this example, we wish to sort the rows of a data frame df
by the lower case transformation of columns col_1
and col_2
.
df_2 = df %>%
arrange(str_to_lower(col_1), str_to_lower(col_2))
Here is how this works:
col_1
and col_2
is transformed individually and the data frame df
is arranged in ascending order of the lower case transformation of col_1
then rows which have the same lower case value for col_1
are sorted in ascending order of the lower case transformation of col_2
.We wish to sort the rows of a data frame by transformations of multiple columns. In particular, we wish to apply a different transformation to each of a set of columns then sort by the transformed columns.
In this example, we wish to sort the rows of a data frame df
in ascending order of the lower case transformation of the column col_1
and the suffix after the underscore in col_2
.
df_2 = df %>%
arrange(str_to_lower(col_1),
str_replace_all(col_2, '.*_', ''))
Here is how this works:
arrange()
can be a distinct transformation of any one or more column.col_1
and the second sorting variable is a substring of col_2
(whatever is after the underscore _
).We wish to sort the rows of a data frame by a function of multiple columns. In particular, we wish to sort by one variable that is the output of a function of multiple columns. For example, we wish to sort by the ratio of two columns.
In this example, we wish to sort the rows of a data frame df
by the ratio between two columns col_1
and col_2
.
df_2 = df %>%
arrange(col_1 / col_2)
Here is how this works:
arrange()
just like we would inside of mutate()
.col_1
are divided by the corresponding values of col_2
. The resulting ratio is then used to sort the rows of the data frame df
in ascending order.We wish to sort the rows of a data frame by an operation that needs to be performed in a non-vectorized manner.
In this example, we wish to sort the rows of the data frame df
by the estimated mean after performing a given number of draws from a normal distribution. The number of draws is provided by the value of the column col_1
. Essentially, we are showing that the larger the sample size the closer the estimate of the mean to the actual mean (which is zero in this case).
df_2 = df %>%
arrange(map_dbl(col_1, ~ abs(mean(rnorm(., 0, 10)))))
Here is how this works:
map_dbl()
because the output we expect has a double data type. See Working with Lists.col_1
, map_dbl()
will execute the anonymous function ~ abs(mean(rnorm(., 0, 10)))
, which breaks down as follows:abs()
returns the absolute value. See Numerical Operations.mean()
returns the mean of a list of values. See Summary Statistics.rnorm(n, mean, sd)
produces n
values from a normal distribution with a mean of mean
and a standard deviation of sd
. See DistributionsWhen we sort on a column with missing data, the rows with the missing values for the sort column will appear at the end. This happens regardless of whether we are sorting in ascending or descending order. In this section we cover how to change this behavior and show rows with missing values at the top (instead of at the bottom).
df_2 = df %>%
arrange(!is.na(col_1), col_1)
Here is how this works:
arrange()
has no parameter to control whether NA
s are sorted at the top or the bottom. Therefore, we need to do a bit of extra work.!is.na(col_1)
as the first input to arrange()
. !is.na(col_1)
is FALSE
when the value of col_1
is NA
and TRUE
otherwise. Since FALSE
sorts higher than TRUE
, all NA
values will be sorted first.arrange()
is col_1
. Therefore, rows where !is.na(col_1)
is TRUE
i.e. not NA
will then be sorted by the value of col_1
which is the desired outcome.df %>% arrange(!is.na(col_6))
.Alternatively,
df_2 = df %>%
arrange(order(str_order(col_6, na_last=FALSE)))
Here is how this works:
str_order()
function from the stringr
package which when its argument na_last=FALSE
sorts a string vector while placing missing values NA
at the top.