Unique Values

We wish to obtain or learn about the unique values in data.

In particular, this section covers the following:

  • In Get Unique Values, we cover how to obtain the actual unique values.
  • In Count Unique Value, we cover how to obtain the number of possible unique values.
  • In Occurrence Frequency, we cover how to compute the frequency of occurrence of each unique value.
  • In Occurrence Proportion, we cover how to compute the ratio between the number of occurrences of each unique value to the total number of occurrences.

In each of those, we cover two scenarios:

  1. One Column: The unique values a single column (i.e. a vector) can take.
  2. Multiple Columns: The unique combinations a set of columns can take.

Get Unique Values

We wish to obtain unique values.

One Column

We wish to obtain the unique values in one column of a data frame (or in one vector of values).

In this example, we wish to obtain the unique values of the column col_1 of the data frame df.

df %>% pull(col_1) %>% unique()

Here is how this works:

  • We use the function unique() to obtain a vector containing the unique values taken by a column of a data frame (i.e. by a vector of values).
  • In df %>% pull(col_1), we extract the column col_1 from the data frame df obtaining a vector of values. We find that this construct makes for more pleasing chains. The code above is equivalent to unique(df$col_1).

Alternative: via distinct()

df_2 = df %>% distinct(col_1)

Here is how this works:

  • While unique() is applied to a vector, distinct() is applied to a data frame.
  • We pass to distinct() the name of the column whose unique values we wish to obtain.
  • The output of distinct() is a data frame where each row holds one unique value.

Multiple Columns

We wish to obtain the unique combinations of values of a set of columns of a data frame.

df_2 = df %>% distinct(col_1, col_2)

Here is how this works:

  • We use distinct() to obtain the unique combinations.
  • We pass to distinct():
    • The data frame, which here is df and is passed via the pipe %>% operator
    • The names of the columns whose unique combinations we wish to obtain, which here are col_1 and col_2.
  • The output of distinct() is a data frame where each row holds one unique combination of the input columns, which here are col_1 and col_2.

Extension: Unique Rows

df_2 = df %>% distinct()

Here is how this works:

If we do not pass any columns to distinct(), it identifies the unique combinations of all columns.

Count Unique Values

We wish to obtain the number of possible unique values.

One Column

We wish to obtain the number of possible unique values in a column of a data frame (or in one vector of values).

In this example, we have a data frame df and we wish to obtain the number of unique values of the column col_2 per group, where the groups are specified by the column col_1.

df_2 = df %>% 
  group_by(col_1) %>% 
  summarize(col_2_n_distinct = n_distinct(col_2))

Here is how this works:

  • We use the function n_distinct() to compute the number of unique values of the column col_2 for each group.
  • The input to n_distinct() is a vector of values and the output is a single integer value.
  • We call n_distinct() inside summarise(), which is called after group_by() to obtain a summary for each group. See Aggregating.

Extension: Ignore Missing

df %>% pull(col_1) %>% n_distinct(na.rm = TRUE)

Here is how this works:

  • By default, missing values count as 1 unique value. Therefore, if there are one or more missing values, the number of unique values will be one plus the number of actual unique values.
  • To change this behavior and disregard missing values, we set na.rm = FALSE in the call to n_distinct().

Multiple Columns

We wish to obtain the number of possible unique combinations of a set of columns of a data frame.

In this example, we have a data frame df and we wish to obtain the number of unique combinations of the columns col_2 and col_3 per group, where the groups are specified by the column col_1.

df_2 = df %>% 
  group_by(col_1) %>% 
  summarize(col_2_3_n_distinct = n_distinct(col_2, col_3))

Here is how this works:

We can pass multiple columns to the function n_distinct(), which here are col_2 and col_3, and it will return the number of unique combinations of those columns.

Occurrence Frequency

We wish to compute the frequency of occurrence of each unique value.

One Column

We wish to compute the frequency of occurrence of each unique value of a column of a data frame (or a vector of values).

In this example, we wish to obtain the number of occurrences of each unique value of the column col_1.

df_2 = df %>% count(col_1, sort = TRUE)

Here is how this works:

  • We use the function count() to obtain the frequency of occurrence of each unique value.
  • We pass to count():
    • The data frame, which here is df and is passed via the pipe %>% operator
    • The name of the column whose unique value occurrence frequency we wish to obtain, which here is col_1
  • Typically, when we look at unique value counts, we wish to see them sorted in descending order of frequency. To do that, we set the argument sort of count() to sort=TRUE.

Extension: Most Frequent Value

We wish to obtain the most frequent value taken by a column of a data frame.

In this example, we have a data frame df and we wish to obtain the most frequent value of the column col_2 per group where the groups are defined by the value of the column col_1.

mode <- function(.x) {
  .x %>% 
    table() %>% 
    sort(decreasing = TRUE) %>% 
    names() %>% 
    first()
}

df_2 = df %>%
    group_by(col_1) %>%
    summarize(col_2_mode = mode(col_2))

Here is how this works:

  • The most frequent value in a vector of values is often referred to as the mode.
  • There is no function in base R or in the tidyverse to obtain the mode. Therefore, we implement a custom function mode() that takes a vector of values and identifies the most frequent value.
  • We call mode() inside summarise(), which is called after group_by() to obtain a summary for each group. See Aggregating.
  • See Ranking for a coverage of computing the ranks of values in a column.

Multiple Columns

We wish to compute the frequency of occurrence of each unique combination of values of a set of columns of a data frame.

In this example, we wish to obtain the number of occurrences of each unique combination of values of the columns col_1 and col_2.

df_2 = df %>% count(col_1, col_2, sort = TRUE)

Here is how this works:

  • We use the function count() to obtain the frequency of occurrence of each unique combination of values.
  • We pass to count():
    • The data frame, which here is df and is passed via the pipe %>% operator
    • The names of the columns whose unique value occurrence frequency we wish to obtain, which here are col_1 and col_2
  • Typically, when we wish to see frequencies sorted in descending order. To do that, we set the argument sort of count() to sort=TRUE.

Alternative: Traditional Aggregation

df_2 = df %>% 
    group_by(col_1, col_2) %>%
    summarize(n = n()) %>%
    arrange(desc(n))

Here is how this works:

  • We use group_by() to create groups with the columns whose combinations we wish to create a frequency table of. See Aggregating.
  • We then use n() inside summarize() to compute the size of each group. See Length.
  • Finally, we use arrange() to sort in descending order of group size. See Sorting.

Occurrence Proportion

We wish to compute the ratio between the number of occurrences of each unique value to the total number of occurrences.

One Column

We wish to compute the ratio between the number of occurrences of each unique value of a column of a data frame to the length of the column.

In this example, we wish to obtain the ratio between the number of occurrences of each unique value of the column col_1 of the data frame df to the length of the column.

df_2 = df %>% 
    count(col_1) %>% 
    mutate(percent = n/sum(n))

Here is how this works:

  • We use the function count() to obtain the frequency of occurrence of each unique value as described under Occurrence Frequency above.
  • We then use mutate() to compute the ratio between the number of occurrences of each unique value of the column col_1 of the data frame df to the length of the column.
  • The expression that computes the ratio is percent = n/sum(n) where
    • percent will be the name of the output column holding the ratio
    • n is the name of the column generated by count() that holds the number of occurrences of each unique value of the column col_1
    • sum(n) is the total number of occurrences i.e. the number of rows of the data frame df
  • The output df_2 is a data frame with one row for each unique value in col_1. It has three columns: col_1, n, and percent. col_1 holds the unique values, n holds the number of occurrences of the corresponding unique value, and percent holds the proportion of occurrence of the corresponding unique value.

Alternative: via tabyl()

library(janitor)

df_2 = df %>% tabyl(col_1)

Here is how this works:

  • We pass a data frame df to the function tabyl() from the janitor package and specify the column of interest; here col_1.
  • We are using the tabyl() function from the janitor package instead of base R’s table() function because:
    • It returns both the frequency n of each value as well as its proportion percent.
    • It is also a lot more convenient to work with (inspired by the tidyverse design principles), e.g. it accepts a data frame and a column name like count().
  • The output df_2 is a data frame with one row for each unique value in col_1. It has three columns: col_1, n, and percent. col_1 holds the unique values, n holds the number of occurrences of the corresponding unique value, and percent holds the proportion of occurrence of the corresponding unique value.

Multiple Columns

We wish to compute the ratio between the number of occurrences of each unique combination of values of a set of columns of a data frame to the length of the data frame.

In this example, we wish to obtain the ratio between the number of occurrences of each unique combination of the values of the columns col_1 and col_2 of the data frame df to the length of the data frame.

df_2 = df %>% 
    count(col_1, col_2, sort = TRUE) %>%
    mutate(percent = n/sum(n))

Here is how this works:

  • This works similarly to the "One Column" solution above except that we pass to count() the names of the set of columns whose unique combinations we are interested in.
  • See Inspecting Factor by Factor for how to create a cross-table between two columns.
R
I/O