We wish to control special aspects of sorting strings.
This section is organized as follows:
We wish to sort by a string column in a case-insensitive manner i.e. for the purpose of sorting, we wish to treat upper and lower case letters as the same. With string columns, sorting is case-sensitive by default, meaning upper case text will appear first when sorting in ascending order and last when sorting in descending order.
In this example, we wish to sort the data frame df
in ascending order of a string column with the name col_1
in a case-insensitive manner.
df %>% arrange(str_to_lower(col_1))
Here is how this works:
arrange()
.arrange()
should be vectorized i.e. it accepts a vector (a column) and returns a vector of the same length.str_to_lower()
from stringr
to covert all characters to lower case. See Formatting.col_1
is created, used for sorting, then discarded. The original values of col_1
stay as is.Sort Elements
Given a vector of strings, we wish to return the sorted vector with digits in a string sorted numerically.
In this example, we wish to summarize the data frame df
over groups defined by the column col_1
. The summary we are after is to concatenate the sorted values of the column col_2
for each group. For sorting, we wish to have digits in the string values sorted numerically.
df_2 = df %>%
group_by(col_1) %>%
summarise(summary = col_2 %>%
str_sort(numeric = TRUE) %>%
str_flatten(collapse = "-"))
Here is how this works:
sort()
function from base R sorts digits in strings as strings e.g. ‘a10’
would be sorted before ‘a9’
.str_sort()
from the stringr
package with the argument numeric=TRUE
to sort the string values of the column col_2
(for each group) such that digits are sorted numerically.str_flatten()
to concatenate together the sorted values of the column col_1
for each group. See Collapsing.Sort Rows
Given a data frame, we wish to sort the rows by the values of a given string column with digits in the string values sorted numerically.
In this example, we wish to sort the rows of the data frame df by the values of the column col_1
df <- tibble(
col_1 = c('a0', 'a10', 'a11', 'a8', 'a9'),
col_2 = c(1, 2, 3, 4, 5))
df_2 = df %>%
arrange(str_rank(col_1, numeric=TRUE))
Here is how this works:
arrange()
. See Sorting.arrange()
sorts digits in strings as strings e.g. ‘a10’
would be sorted before ‘a9’
.arrange()
with the function str_rank()
from stringr
and pass the argument numeric=TRUE
.arrange()
uses the integer rank values returned by str_rank()
to sort the rows of the data frame by the values of the column col_1
so that digits in a string are sorted numerically.We wish to define the locale to use while sorting a list of strings. Strings may sort differently in different locales.
In this example, we wish to sort the rows of the data frame df
by the values of the string column col_1
while taking into account that the characters are in Lithuanian, where 'y'
comes between 'i'
and 'k'
.
df_2 = df %>%
arrange(str_rank(col_1, locale = "lt"))
Here is how this works:
stringr
library is that if an operation is locale-sensitive; i.e. would return different outputs for the same input depending on the locale e.g. sorting and case setting, it would accept a locale
argument to which we can pass the desired locale.locale
is assumed.str_rank()
from the stringr
package to obtain the ranks as integers of the values of the column col_1
in the locale "lt" (Lithuanian).