Unique Values

We wish to obtain or learn about the unique values in data.

In particular, this section covers the following:

  • In Get Unique Values, we cover how to obtain the actual unique values.
  • In Count Unique Value, we cover how to obtain the number of possible unique values.
  • In Occurrence Frequency, we cover how to compute the frequency of occurrence of each unique value.
  • In Occurrence Proportion, we cover how to compute the ratio between the number of occurrences of each unique value to the total number of occurrences.

In each of those, we cover two scenarios:

  1. One Column: The unique values a single column (i.e. a vector) can take.
  2. Multiple Columns: The unique combinations a set of columns can take.

Get Unique Values

We wish to obtain unique values.

One Column

We wish to obtain the unique values in one column of a data frame (or in one vector of values).

In this example, we wish to obtain the unique values of the column col_1 of the data frame df.

df['col_1'].unique()

Here is how this works:

We use the method unique() to obtain an array containing the unique values taken by a column of a data frame (i.e. by a vector of values).

Alternative: Non-Series Vector

lst = ['a', 'a', 'b', 'b']

pd.unique(lst)

Here is how this works:

If we wish to find the unique values in a non-Series array-like object e.g. a list or an array, we call unique via its package identifier i.e. pd.unique().

Alternative: Convert to Set

In this example, we wish to obtain the unique values of the column col_2 for each group, where the groups are specified by the column col_1.

df_2 = df \
    .groupby('col_1') \
    .agg(col_2_unique=('col_2', set))

Here is how this works:

By definition, elements of a set are unique. Therefore, we can obtain unique elements by converting an array-like object to a set.

Multiple Columns

We wish to obtain the unique combinations of values of a set of columns of a data frame.

df = pd.DataFrame(
    {'col_1': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
     'col_2': ['a', 'a', 'b', 'b', 'a', 'a', 'b', 'b'],
     'col_3': ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b']})

df_2 = df.loc[:, ['col_1', 'col_2']].drop_duplicates()

Here is how this works:

  • In loc[:, ['col_1', 'col_2']], we produce a sub-data-frame containing the columns whose value combinations we are interested in obtaining. See Selecting by Name.
  • We then call drop_duplicates() on the resulting sub data frame to get the unique combinations that the selected columns take in the data. See Duplicates.

Extension: Unique Rows

df = pd.DataFrame(
    {'col_1': ['a', 'a', 'a', 'b', 'b', 'a'],
     'col_2': ['a', 'b', 'b', 'a', 'b', 'a']})

df_2 = df.drop_duplicates()

Here is how this works:

If we wish to find unique combinations of all columns of the data frame, then we simply call drop_duplicates() on the data frame. See Duplicates.

Count Unique Values

We wish to obtain the number of possible unique values.

One Column

We wish to obtain the number of possible unique values in a column of a data frame (or in one vector of values).

In this example, we have a data frame df and we wish to obtain the number of unique values of the column col_2 per group, where the groups are specified by the column col_1.

df = pd.DataFrame(
    {'col_1': ['a', 'a', 'b', 'b'],
     'col_2': [10, 15, 20, 20]})

df_2 = df.groupby('col_1')['col_2'].nunique()

Here is how this works:

  • We use the nunique() method to compute the number of unique values of the column col_2 for each group.
  • The input to nunique() is a Series and the output is a single integer value.
  • When called on a grouped Series, like in this case, the output is a Series of integer values.
  • By default, nunique() excludes missing values in the unique counts. To retain missing values, i.e. to let missing values be counted as one unique value, we can set the argument dropna of nunique() to dropna=False.

Multiple Columns

We wish to obtain the number of possible unique combinations of a set of columns of a data frame.

In this example, we have a data frame df and we wish to obtain the number of unique combinations of the columns col_2 and col_3 per group, where the groups are specified by the column col_1.

df = pd.DataFrame(
    {'col_1': ['a', 'a', 'a', 'b', 'b', 'b'],
     'col_2': [10, 10, 15, 20, 20, 20],
     'col_3': [1, 1, 2, 3, 4, 5]})

df_2 = df\
    .drop_duplicates(['col_1', 'col_2', 'col_3'])\
    .groupby('col_1', as_index=False)\
    .agg(n_unique=('col_3', 'size'))

Here is how this works:

  • We use drop_duplicates() to retain one instance of each <col_1, col_2, col_3> combination.
  • We then group by col_1, and use agg() and size() to count the number of combinations of col_2 and col_3 for each group.
  • Even though size() counts the number of rows, we still need to pass a column name. This is intuitively unnecessary but is just how agg() works. See Aggregating.

Alternative: via Sequential Grouping

df = pd.DataFrame(
    {'col_1': ['a', 'a', 'a', 'b', 'b', 'b'],
     'col_2': [10, 10, 15, 20, 20, 20],
     'col_3': [1, 1, 2, 3, 4, 5]})

df_2 = df\
    .groupby(['col_1', 'col_2', 'col_3'], as_index=False)\
    .agg('size')\
    .groupby('col_1', as_index=False)\
    .agg(n_unique=('col_3', 'size'))

Here is how this works:

  • To compute the number of unique combinations of multiple columns for each group, we carry out two aggregations in sequence.
  • In the first aggregation, we group by both the grouping columns, which here is col_1, and the columns whose unique combinations we wish to count, which here are col_2 and col_3. The purpose of this is to keep just one instance of each <col_1, col_2, col_3> combination.
  • In the second aggregation, we count the number of combinations of col_2 and col_3 for each group specified by the value of col_1.

Alternative: via Counting Unique Combination

df = pd.DataFrame(
    {'col_1': ['a', 'a', 'a', 'b', 'b', 'b'],
     'col_2': [10, 10, 15, 20, 20, 20],
     'col_3': [1, 1, 2, 3, 4, 5]})

df_2 = df\
    .groupby('col_1', as_index=False)\
    .apply(lambda x: len(x.drop_duplicates(['col_2', 'col_3'])))\
    .rename(columns={None:'n_unique'})

Here is how this works:

  • This is perhaps the most straight forward of the solutions provided here but is a bit more challenging to implement correctly.
  • We start by grouping by col_1 then we use apply() to count the number of combinations in each group.
  • apply() gets the sub-data frame corresponding to each group and it then drops duplicates of the combinations of values of col_2 and col_3 and counts the resulting combinations.
  • Finally, we set the name of the resulting column form None to ‘n_unique’. See Renaming.

Occurrence Frequency

We wish to compute the frequency of occurrence of each unique value.

One Column

We wish to compute the frequency of occurrence of each unique value of a column of a data frame (or a vector of values).

In this example, we wish to obtain the number of occurrences of each unique value of the column col_1.

df['col_1'].value_counts()

Here is how this works:

  • We use the method value_counts() to obtain the number of occurrences of each unique value in the column col_1.
  • The output of value_counts() is a Series where the index holds the unique values and the values are the integer counts of the number of occurrences of the corresponding unique value.

Extension: Most Frequent Value

df_2 = df\
    .groupby('col_1', as_index=False)\
    .agg(col_2_mode = ('col_2', pd.Series.mode))

Here is how this works:

  • The most frequent value in a vector of values is often referred to as the mode.
  • We use the mode() method of Pandas Series to identify the most frequent value.
  • We call mode() inside agg(), which is called after groupby() to obtain a summary for each group. See Aggregating.

Extension: Ignore Missing

df['col_1'].value_counts(dropna=True)

Here is how this works:

  • Missing values are excluded by default.
  • We can use the dropna argument of the value_counts() method to include missing values i.e. include a single unique value for missing values whose frequency is the number of missing values in the column.

Multiple Columns

We wish to compute the frequency of occurrence of each unique combination of values of a set of columns of a data frame.

In this example, we wish to obtain the number of occurrences of each unique combination of values of the columns col_1 and col_2.

df_2 = df.value_counts(['col_1', 'col_2']).reset_index()

Here is how this works:

  • We use the function value_counts() to obtain the frequency of occurrence of each unique combination of values.
  • We pass to value_counts():
    • The data frame, which here is df and on which the method is called.
    • The names of the column whose unique value occurrence frequency we wish to obtain, which here is col_1
  • The output of value_counts() in this case is a series with a MultiIndex. We call reset_index() to convert the series in to a data frame where the MultiIndex is converted to columns. This can be ignored in data inspection scenarios.
  • While we do not show that here, if the series output of value_counts() is converted to a data frame, we often need to rename the count column via something like .rename(columns={0: 'n_unique'}). See Renaming.

Alternative: Traditional Aggregation

df_2 = df\
    .groupby(['col_1', 'col_2'], as_index=False)\
    .size()\
    .sort_values(by='size', ascending=False)

Here is how this works:

  • We use groupby() to create groups with the columns whose combinations we wish to create a frequency table of. See Aggregating.
  • We then use size() on the groups to compute the size of each group. See Length.
  • We use sort_values() to sort in descending order of group size. See Sorting.

Occurrence Proportion

We wish to compute the ratio between the number of occurrences of each unique value to the total number of occurrences.

One Column

We wish to compute the ratio between the number of occurrences of each unique value of a column of a data frame to the length of the column.

In this example, we wish to obtain the ratio between the number of occurrences of each unique value of the column col_1 of the data frame df to the length of the column.

df['col_1'].value_counts(normalize=True)

Here is how this works:

  • In order to obtain the proportions of the total number of rows that take each possible value of the non-numeric column, we call value_counts() with the argument normalize=True.
  • Note that value_counts() excludes missing values, so the proportions might not add up to 1. To get around that we can set the argument dropna=False which will add a single unique value for missing values

Unique Combinations

We wish to compute the ratio between the number of occurrences of each unique combination of values of a set of columns of a data frame to the length of the data frame.

In this example, we wish to obtain the ratio between the number of occurrences of each unique combination of the values of the columns col_1 and col_2 of the data frame df to the length of the data frame.

df.value_counts(['col_1', 'col_2'], normalize=True)

Here is how this works:

This works similarly to the one column scenario above except that we pass to value_counts() a list of the columns whose value combinations we are interested in, which here is ['col_1', 'col_2'].

PYTHON
I/O