We wish to obtain or learn about the unique values in data.
In particular, this section covers the following:
In each of those, we cover two scenarios:
We wish to obtain unique values.
One Column
We wish to obtain the unique values in one column of a data frame (or in one vector of values).
In this example, we wish to obtain the unique values of the column col_1
of the data frame df
.
df %>% pull(col_1) %>% unique()
Here is how this works:
unique()
to obtain a vector containing the unique values taken by a column of a data frame (i.e. by a vector of values).df %>% pull(col_1)
, we extract the column col_1
from the data frame df
obtaining a vector of values. We find that this construct makes for more pleasing chains. The code above is equivalent to unique(df$col_1)
.Alternative: via distinct()
df_2 = df %>% distinct(col_1)
Here is how this works:
unique()
is applied to a vector, distinct()
is applied to a data frame.distinct()
the name of the column whose unique values we wish to obtain.distinct()
is a data frame where each row holds one unique value.Multiple Columns
We wish to obtain the unique combinations of values of a set of columns of a data frame.
df_2 = df %>% distinct(col_1, col_2)
Here is how this works:
distinct()
to obtain the unique combinations.distinct()
:df
and is passed via the pipe %>%
operatorcol_1
and col_2
.distinct()
is a data frame where each row holds one unique combination of the input columns, which here are col_1
and col_2
.Extension: Unique Rows
df_2 = df %>% distinct()
Here is how this works:
If we do not pass any columns to distinct()
, it identifies the unique combinations of all columns.
We wish to obtain the number of possible unique values.
One Column
We wish to obtain the number of possible unique values in a column of a data frame (or in one vector of values).
In this example, we have a data frame df
and we wish to obtain the number of unique values of the column col_2
per group, where the groups are specified by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarize(col_2_n_distinct = n_distinct(col_2))
Here is how this works:
n_distinct()
to compute the number of unique values of the column col_2
for each group.n_distinct()
is a vector of values and the output is a single integer value.n_distinct()
inside summarise()
, which is called after group_by()
to obtain a summary for each group. See Aggregating.Extension: Ignore Missing
df %>% pull(col_1) %>% n_distinct(na.rm = TRUE)
Here is how this works:
na.rm = FALSE
in the call to n_distinct()
.Multiple Columns
We wish to obtain the number of possible unique combinations of a set of columns of a data frame.
In this example, we have a data frame df
and we wish to obtain the number of unique combinations of the columns col_2
and col_3
per group, where the groups are specified by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarize(col_2_3_n_distinct = n_distinct(col_2, col_3))
Here is how this works:
We can pass multiple columns to the function n_distinct()
, which here are col_2
and col_3
, and it will return the number of unique combinations of those columns.
We wish to compute the frequency of occurrence of each unique value.
One Column
We wish to compute the frequency of occurrence of each unique value of a column of a data frame (or a vector of values).
In this example, we wish to obtain the number of occurrences of each unique value of the column col_1
.
df_2 = df %>% count(col_1, sort = TRUE)
Here is how this works:
count()
to obtain the frequency of occurrence of each unique value.count()
:df
and is passed via the pipe %>%
operatorcol_1
sort
of count()
to sort=TRUE
.Extension: Most Frequent Value
We wish to obtain the most frequent value taken by a column of a data frame.
In this example, we have a data frame df
and we wish to obtain the most frequent value of the column col_2
per group where the groups are defined by the value of the column col_1
.
mode <- function(.x) {
.x %>%
table() %>%
sort(decreasing = TRUE) %>%
names() %>%
first()
}
df_2 = df %>%
group_by(col_1) %>%
summarize(col_2_mode = mode(col_2))
Here is how this works:
mode()
that takes a vector of values and identifies the most frequent value.mode()
inside summarise()
, which is called after group_by()
to obtain a summary for each group. See Aggregating.Multiple Columns
We wish to compute the frequency of occurrence of each unique combination of values of a set of columns of a data frame.
In this example, we wish to obtain the number of occurrences of each unique combination of values of the columns col_1
and col_2
.
df_2 = df %>% count(col_1, col_2, sort = TRUE)
Here is how this works:
count()
to obtain the frequency of occurrence of each unique combination of values.count()
:df
and is passed via the pipe %>%
operatorcol_1
and col_2
sort
of count()
to sort=TRUE
.Alternative: Traditional Aggregation
df_2 = df %>%
group_by(col_1, col_2) %>%
summarize(n = n()) %>%
arrange(desc(n))
Here is how this works:
group_by()
to create groups with the columns whose combinations we wish to create a frequency table of. See Aggregating.n()
inside summarize()
to compute the size of each group. See Length.arrange()
to sort in descending order of group size. See Sorting.We wish to compute the ratio between the number of occurrences of each unique value to the total number of occurrences.
One Column
We wish to compute the ratio between the number of occurrences of each unique value of a column of a data frame to the length of the column.
In this example, we wish to obtain the ratio between the number of occurrences of each unique value of the column col_1
of the data frame df
to the length of the column.
df_2 = df %>%
count(col_1) %>%
mutate(percent = n/sum(n))
Here is how this works:
count()
to obtain the frequency of occurrence of each unique value as described under Occurrence Frequency above.mutate()
to compute the ratio between the number of occurrences of each unique value of the column col_1
of the data frame df
to the length of the column.percent
will be the name of the output column holding the ration
is the name of the column generated by count() that holds the number of occurrences of each unique value of the column col_1sum(n)
is the total number of occurrences i.e. the number of rows of the data frame df
df_2
is a data frame with one row for each unique value in col_1
. It has three columns: col_1
, n
, and percent
. col_1
holds the unique values, n
holds the number of occurrences of the corresponding unique value, and percent
holds the proportion of occurrence of the corresponding unique value.Alternative: via tabyl()
library(janitor)
df_2 = df %>% tabyl(col_1)
Here is how this works:
df
to the function tabyl()
from the janitor
package and specify the column of interest; here col_1
.tabyl()
function from the janitor
package instead of base R’s table()
function because:n
of each value as well as its proportion percent
.count()
.df_2
is a data frame with one row for each unique value in col_1
. It has three columns: col_1
, n
, and percent
. col_1
holds the unique values, n
holds the number of occurrences of the corresponding unique value, and percent
holds the proportion of occurrence of the corresponding unique value.Multiple Columns
We wish to compute the ratio between the number of occurrences of each unique combination of values of a set of columns of a data frame to the length of the data frame.
In this example, we wish to obtain the ratio between the number of occurrences of each unique combination of the values of the columns col_1
and col_2
of the data frame df
to the length of the data frame.
df_2 = df %>%
count(col_1, col_2, sort = TRUE) %>%
mutate(percent = n/sum(n))
Here is how this works:
count()
the names of the set of columns whose unique combinations we are interested in.