Summarize Non-Numeric Column

On this page, we cover how to get a summary of a non numeric column so we may gain some insight into the data it holds.

Non-numeric columns include categorical, string, factor, Boolean and date-time columns. For the purpose of high level inspection, they can more or less be summarized in the same way e.g. number of unique values.

Unique Values

We wish to get the unique values that a non-numeric column takes.

df %>% distinct(col_1)

Here is how this works:

  • We pass a data frame df to the function distinct().
  • We also pass to distinct() the name of the column whose unique values we are interested in getting (here col_1).

Alternatively, we could also use df %>% pull(col_1) %>% unique(). All equal, we prefer the use of distinct() because it can works the in the same way for multiple columns to get their unique value combinations.

Frequency

We wish to know how many times does each of the possible values of a particular non-numeric column of interest exist in a data frame. In other words, for how many rows of the data frame does the non-numeric column of interest take each of its possible values.

df %>% count(col_1, sort = TRUE)

Here is how this works:

  • We pass a data frame df to the function count().
  • We also pass to count() the name of the column whose unique values we are interested in getting (here col_1).
  • We set sort = True so the output is sorted in descending order of frequency.
  • If there are missing values (encoded as NA), count() will return a separate category for missing values.
  • In many situations while working with categorical data, eg when filtering, we need to know the exact possible values that the categorical variable can take. A common way to see those exact values is via count().

Proportion

We wish to know the proportion (percentage or density) of the total number of rows (observations) that take each possible value of the non-numeric column (variable).

library(janitor)
df %>% tabyl(col_1)

Here is how this works:

  • We pass a data frame df to the function tabyl() and specify the column of interest; here col_1.
  • We are using the tabyl() function from the janitor package instead of base R’s table() function which returns both the frequency n of each value as well as it’s proportion.

Alternatively, we could use df %>% count(col_2) %>% mutate(percent = n/sum(n)) to obtain the same result. We find tabyl() to be a better approach for many scenarios especially during data inspection.

Frequency of Frequency

We wish to know how many values occur a particular number of times e.g. how many values occur exactly 1 time.

df %>% count(col_1) %>% count(n)

Here is how this works:

  • We execute count() twice.
  • The first execution of count() returns the number of times n each possible value of col_1 occurs in the data frame df.
  • The second execution of count() acts on the frequency column n generated by the first run and returns the frequency of each value of n which is the frequency of frequency that we are after.

Filter Summary

When dealing with real data sets, the column we may wish to summarize with value_counts() is likely to have many possible values some of which are infrequent (rarely occurring) and often not of much value to the task at hand (or of all the value if we are looking for rarely occurring values) but could make the output noisy and harder to process. In such situations, it helps to filter the summary down to the parts that are useful to the task at hand which is what we cover in this section.

By Frequency

We wish to see only values of a non-numeric column that have a certain frequency or higher (or lower perhaps to view possible encoding issues).

In this example, we wish to get values of a variable that exist 10 times or more in the data set.

df %>% tabyl(col_1) %>% filter(n > 10)

Here is how this works:

  • We use tabyl() to obtain the frequency table for col_1 as described above.
  • The output of tabyl() is a data frame which has a column called n holding the number of rows taking each value of the column of interest (here col_1). We can, therefore, use filter() to get only the values of col_1 that occur more than a given number of times (here 10).

By Proportion

We wish to see only values of a non-numeric column that have a certain proportion or higher (or lower perhaps to view possible encoding issues).

In this example, we wish to get values of a non-numeric variable that comprise 10% or more of the data set.

df %>% tabyl(col_1) %>% filter(percent > 0.1)

Here is how this works:

  • We use tabyl() to obtain the frequency table for col_1 as described above.
  • The output of tabyl() is a data frame with a percent column holding the proportion of rows taking each value of the column of interest (here col_1). We can, therefore, use filter() to get only the values of col_1 that have a certain percent or higher (here higher than 10%).

By Value

We wish to get the frequency of only a particular set of a column’s possible values (rather than all of it’s possible values).

df %>% tabyl(col_1) %>% filter(col_1 %in% c('a', 'c'))

Here is how this works:

  • We use tabyl() to obtain the frequency table for col_1 as described above.
  • The output of tabyl() is a data frame with a column named after the input column (here col_1 ). We can, therefore, use filter() to get only the values of col_1 that we are interested in. See Filtering for more on row filtering.

Most Frequent

We wish to limit the output to just the most prevalent (or least prevalent) values of the variable being summarized.

df %>% tabyl(col_1) %>% arrange(desc(n)) %>% slice_head(n=10)

Here is how this works:

  • We use tabyl() to obtain the frequency table for col_1 as described above.
  • We then apply arrange(desc(n)) to arrange the values of col_1 in descending order of their frequency.
  • We then use slice_head(n=10) to return the top 10 most frequently occurring values of col_1.
R
I/O