Sorting

We wish to control special aspects of sorting strings.

This section is organized as follows:

  • Case Insensitive: We wish to sort by a string column in a case-insensitive manner i.e. for the purpose of sorting, we wish to treat upper and lower case letters as the same.
  • Numeric Order: We wish to specify whether digits in a string are sorted numerically, e.g. 10 comes after 9, or as strings, e.g. 10 comes before 9.
  • Locale Specific: A special case of string sorting is Locale. Strings may sort differently in different Locales.

Case Insensitive

We wish to sort by a string column in a case-insensitive manner i.e. for the purpose of sorting, we wish to treat upper and lower case letters as the same. With string columns, sorting is case-sensitive by default, meaning upper case text will appear first when sorting in ascending order and last when sorting in descending order.

In this example, we wish to sort the data frame df in ascending order of a string column with the name col_1 in a case-insensitive manner.

df %>% arrange(str_to_lower(col_1)) 

Here is how this works:

  • To sort by the transformed values of a column, we simply apply the transformation we want inside of arrange().
  • The transformation function we use inside arrange() should be vectorized i.e. it accepts a vector (a column) and returns a vector of the same length.
  • We use str_to_lower() from stringr to covert all characters to lower case. See Formatting.
  • An all lower case version of col_1 is created, used for sorting, then discarded. The original values of col_1 stay as is.

Numeric Order

Sort Elements

Given a vector of strings, we wish to return the sorted vector with digits in a string sorted numerically.

In this example, we wish to summarize the data frame df over groups defined by the column col_1. The summary we are after is to concatenate the sorted values of the column col_2 for each group. For sorting, we wish to have digits in the string values sorted numerically.

df_2 = df %>% 
  group_by(col_1) %>% 
  summarise(summary = col_2 %>% 
              str_sort(numeric = TRUE) %>% 
              str_flatten(collapse = "-"))

Here is how this works:

  • The sort() function from base R sorts digits in strings as strings e.g. ‘a10’ would be sorted before ‘a9’.
  • We use the function str_sort() from the stringr package with the argument numeric=TRUE to sort the string values of the column col_2 (for each group) such that digits are sorted numerically.
  • We use the function str_flatten() to concatenate together the sorted values of the column col_1 for each group. See Collapsing.

Sort Rows

Given a data frame, we wish to sort the rows by the values of a given string column with digits in the string values sorted numerically.

In this example, we wish to sort the rows of the data frame df by the values of the column col_1

df <- tibble(
  col_1 = c('a0', 'a10', 'a11', 'a8', 'a9'),
    col_2 = c(1, 2, 3, 4, 5))

df_2 = df %>% 
  arrange(str_rank(col_1, numeric=TRUE))

Here is how this works:

  • The go-to function to sort the rows of a data frame is arrange(). See Sorting.
  • arrange() sorts digits in strings as strings e.g. ‘a10’ would be sorted before ‘a9’.
  • To get digits to sort numerically, we augment arrange() with the function str_rank() from stringr and pass the argument numeric=TRUE.
  • arrange() uses the integer rank values returned by str_rank() to sort the rows of the data frame by the values of the column col_1 so that digits in a string are sorted numerically.

Locale Specific

We wish to define the locale to use while sorting a list of strings. Strings may sort differently in different locales.

In this example, we wish to sort the rows of the data frame df by the values of the string column col_1 while taking into account that the characters are in Lithuanian, where 'y' comes between 'i' and 'k'.

df_2 = df %>% 
  arrange(str_rank(col_1, locale = "lt"))

Here is how this works:

  • The standard in the stringr library is that if an operation is locale-sensitive; i.e. would return different outputs for the same input depending on the locale e.g. sorting and case setting, it would accept a locale argument to which we can pass the desired locale.
  • If we do not pass a locale argument, the current global locale is assumed.
  • In this example, we use the function str_rank() from the stringr package to obtain the ranks as integers of the values of the column col_1 in the locale "lt" (Lithuanian).
R
I/O