Summarize Non-Numeric Column

On this page, we cover how to get a summary of a non numeric column so we may gain some insight into the data it holds.

Non-numeric columns include categorical, string, factor, Boolean and date-time columns. For the purpose of high level inspection, they can more or less be summarized in the same way e.g. number of unique values.

Unique Values

We wish to get the unique values that a non-numeric column takes.

df['col_1'].unique()

Here is how this works:

  • We select the column of interest via df['col_1'].
  • We then apply the function unique() to get the unique values of the selected column.

Frequency

We wish to know how many times does each of the possible values of a particular non-numeric column of interest exist in a data frame. In other words, for how many rows of the data frame does the non-numeric column of interest take each of its possible values.

df['col_1'].value_counts()

Here is how this works:

  • We use the value_counts() method of Pandas Series to return the frequency of occurrence of each value of the variable (here col_1).
  • A convenience of value_counts() is that it automatically sorts the output by frequency so the most frequent values are seen at the top. We can turn this off by setting the argument sort=False.
  • By default, value_counts() drops missing values. If we wish to see a category for missing values, we can set dropna=False.
  • In many situations while working with categorical data, eg when filtering, we need to know the exact possible values that the categorical variable can take. A common way to see those exact values is via value_counts().

Proportion

We wish to know the proportion (percentage or density) of the total number of rows (observations) that take each possible value of the non-numeric column (variable).

df['col_1'].value_counts(normalize=True)

Here is how this works:

  • In order to obtain the proportions of the total number of rows that take each possible value of the non-numeric column, we call value_counts() with the argument normalize=True.
  • Note that value_counts() excludes missing values, so the proportions might not add up to 1. To get around that we can set the argument dropna=False.

and should we wish to see counts and proportions side by side we could do this:

d = pd.DataFrame()
d['count'] = df['col_2'].value_counts()
d['proportion'] = df['col_2'].value_counts(normalize=True)

Frequency of Frequency

We wish to know how many values occur a particular number of times e.g. how many values occur exactly 1 time.

df['col_1'].value_counts().value_counts()

Here is how this works:

  • We execute value_counts() twice.
  • The first execution of value_counts() returns the number of times each possible value of col_1 occurs in the data frame df as a Series.
  • The second execution of value_counts() acts on the Series generated by the first run and returns the frequency of each value which is the frequency of frequency that we are after.

Filter Summary

When dealing with real data sets, the column we may wish to summarize with value_counts() is likely to have many possible values some of which are infrequent (rarely occurring) and often not of much value to the task at hand (or of all the value if we are looking for rarely occurring values) but could make the output noisy and harder to process. In such situations, it helps to filter the summary down to the parts that are useful to the task at hand which is what we cover in this section.

By Frequency

We wish to see only values of a non-numeric column that have a certain frequency or higher (or lower perhaps to view possible encoding issues).

In this example, we wish to get values of a variable that exist 10 times or more in the data set.

df['col_1'].value_counts().loc[lambda x : x > 10]

Here is how this works:

  • We select the column of interest via df[’col_1’].
  • We apply value_counts() as we did above to obtain a vector (Series) with the number of occurrences (frequency) of each value of the selected column (col_1).
  • We use .loc[] to filter the Series produced by value_counts() via a boolean condition.
  • We use a callable (anonymous function) lambda x: x > 10 inside loc[] to apply the condition to each value of the input Series.

By Proportion

We wish to see only values of a non-numeric column that have a certain proportion or higher (or lower perhaps to view possible encoding issues).

In this example, we wish to get values of a non-numeric variable that comprise 10% or more of the data set.

df['col_1'].value_counts(normalize=True).loc[lambda x : x > 0.1]

Here is how this works:

This works similarly to the “Filtering By Frequency” scenario above except that we set the argument normalize=True when we call value_counts() so it may generate the proportion (ratio of the number of rows that take the value to the total number of rows) instead of the count.

By Value

We wish to get the frequency of only a particular set of a column’s possible values (rather than all of it’s possible values).

In this example, we wish to get the number of occurrences of two possible values, ‘a’ and ‘f’, of the column col_2.

df['col_2'].value_counts().loc[['a', 'f']]

Here is how this works:

  • We use value_counts() to compute a frequency distribution. The output is a Series object where the index is the values of the original categorical variable in this case col_2.
  • We then use loc[] to select a few particular values of the categorical variable (since the index is conveniently set as the values of the original categorical variable).

Most Frequent

We wish to limit the output to just the most prevalent (or least prevalent) values of the variable being summarized.

df['col_1'].value_counts().head(10)

Here is how this works:

  • We apply value_counts() to the column of interest (col_1) as described above.
  • We then apply head() or tail() to get the top n or bottom n values. In this case n=10.
  • Note that value_counts() sorts its output by descending order of frequency by default.
PYTHON
I/O