Factor Sorting

We wish to sort an ordinal categorical variable in a domain accurate manner not in alphanumeric order. An ordinal categorical variable is one where there is a natural order e.g. responses in a survey.

In this example, we have a data frame df that has a column col_1 which is categorical variable holding t-shirt sizes XS, S, M, L, XL. We wish to sort the rows of the data frame df in the natural order of the t-shirt sizes i.e. XS < S < M < L < XL.

sizes = c('XS', 'S', 'M', 'L', 'XL')

df_2 = df %>%
  mutate(
    col_1 = factor(col_1,
                   levels = sizes, 
                   ordered=TRUE)) %>%
  arrange(col_1)

Here is how this works:

  • If the column we wish to sort by is already type cast to an ordered factor with the levels correctly defined, we can simply apply arrange() and the data frame will be sorted according to the defined order of the factor variable.
  • To convert a string column to a factor data type we use the function factor() to which we pass the column to be converted to factor (which here is col_1), the levels of the factor (which here in the vector sizes), and a parameter ordered that determines if the factor is ordered (ordinal) if we set ordered=TRUE or unordered (nominal) if we set ordered=FALSE (which here is set to ordered=TRUE). See Factor Operations for more details.

Alternative: Value Mapping

sort_func <- function(col) {
  col_b = case_when(
    col == 'XS' ~ 1,
    col == 'S' ~ 2,
    col == 'M' ~ 3,
    col == 'L' ~ 4,
    col == 'XL' ~ 5,
    TRUE ~ 6
  )
  return(col_b)
}

df_2 = df %>% 
    arrange(sort_func(col_1))

Here is how this works:

  • In some situations, converting a variable to a categorical data type is not appropriate.
  • In those situations we can use a function where we define whatever custom sorting logic we need. The function needs to accept a vector and return a vector of the same size. In this example, the function is sort_func().
  • We can then call that function from within arrange() and pass to it the sorting column which is here col_1.
  • We used case_when() to map values of the column col_1 to an integer that defines their sorting order. See General Operations for a coverage of conditional statements in R.
R
I/O