We wish to obtain or learn about the unique values in data.
In particular, this section covers the following:
In each of those, we cover two scenarios:
We wish to obtain unique values.
One Column
We wish to obtain the unique values in one column of a data frame (or in one vector of values).
In this example, we wish to obtain the unique values of the column col_1
of the data frame df
.
df['col_1'].unique()
Here is how this works:
We use the method unique()
to obtain an array containing the unique values taken by a column of a data frame (i.e. by a vector of values).
Alternative: Non-Series Vector
lst = ['a', 'a', 'b', 'b']
pd.unique(lst)
Here is how this works:
If we wish to find the unique values in a non-Series array-like object e.g. a list or an array, we call unique via its package identifier i.e. pd.unique()
.
Alternative: Convert to Set
In this example, we wish to obtain the unique values of the column col_2
for each group, where the groups are specified by the column col_1
.
df_2 = df \
.groupby('col_1') \
.agg(col_2_unique=('col_2', set))
Here is how this works:
By definition, elements of a set are unique. Therefore, we can obtain unique elements by converting an array-like object to a set.
Multiple Columns
We wish to obtain the unique combinations of values of a set of columns of a data frame.
df = pd.DataFrame(
{'col_1': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'col_2': ['a', 'a', 'b', 'b', 'a', 'a', 'b', 'b'],
'col_3': ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b']})
df_2 = df.loc[:, ['col_1', 'col_2']].drop_duplicates()
Here is how this works:
loc[:, ['col_1', 'col_2']]
, we produce a sub-data-frame containing the columns whose value combinations we are interested in obtaining. See Selecting by Name.drop_duplicates()
on the resulting sub data frame to get the unique combinations that the selected columns take in the data. See Duplicates.Extension: Unique Rows
df = pd.DataFrame(
{'col_1': ['a', 'a', 'a', 'b', 'b', 'a'],
'col_2': ['a', 'b', 'b', 'a', 'b', 'a']})
df_2 = df.drop_duplicates()
Here is how this works:
If we wish to find unique combinations of all columns of the data frame, then we simply call drop_duplicates()
on the data frame. See Duplicates.
We wish to obtain the number of possible unique values.
One Column
We wish to obtain the number of possible unique values in a column of a data frame (or in one vector of values).
In this example, we have a data frame df
and we wish to obtain the number of unique values of the column col_2
per group, where the groups are specified by the column col_1
.
df = pd.DataFrame(
{'col_1': ['a', 'a', 'b', 'b'],
'col_2': [10, 15, 20, 20]})
df_2 = df.groupby('col_1')['col_2'].nunique()
Here is how this works:
nunique()
method to compute the number of unique values of the column col_2
for each group.nunique()
is a Series and the output is a single integer value.nunique()
excludes missing values in the unique counts. To retain missing values, i.e. to let missing values be counted as one unique value, we can set the argument dropna of nunique()
to dropna=False
.Multiple Columns
We wish to obtain the number of possible unique combinations of a set of columns of a data frame.
In this example, we have a data frame df
and we wish to obtain the number of unique combinations of the columns col_2
and col_3
per group, where the groups are specified by the column col_1
.
df = pd.DataFrame(
{'col_1': ['a', 'a', 'a', 'b', 'b', 'b'],
'col_2': [10, 10, 15, 20, 20, 20],
'col_3': [1, 1, 2, 3, 4, 5]})
df_2 = df\
.drop_duplicates(['col_1', 'col_2', 'col_3'])\
.groupby('col_1', as_index=False)\
.agg(n_unique=('col_3', 'size'))
Here is how this works:
drop_duplicates()
to retain one instance of each <col_1, col_2, col_3>
combination.col_1
, and use agg()
and size()
to count the number of combinations of col_2
and col_3
for each group.size()
counts the number of rows, we still need to pass a column name. This is intuitively unnecessary but is just how agg()
works. See Aggregating.Alternative: via Sequential Grouping
df = pd.DataFrame(
{'col_1': ['a', 'a', 'a', 'b', 'b', 'b'],
'col_2': [10, 10, 15, 20, 20, 20],
'col_3': [1, 1, 2, 3, 4, 5]})
df_2 = df\
.groupby(['col_1', 'col_2', 'col_3'], as_index=False)\
.agg('size')\
.groupby('col_1', as_index=False)\
.agg(n_unique=('col_3', 'size'))
Here is how this works:
col_1
, and the columns whose unique combinations we wish to count, which here are col_2
and col_3
. The purpose of this is to keep just one instance of each <col_1, col_2, col_3>
combination.col_2
and col_3
for each group specified by the value of col_1
.Alternative: via Counting Unique Combination
df = pd.DataFrame(
{'col_1': ['a', 'a', 'a', 'b', 'b', 'b'],
'col_2': [10, 10, 15, 20, 20, 20],
'col_3': [1, 1, 2, 3, 4, 5]})
df_2 = df\
.groupby('col_1', as_index=False)\
.apply(lambda x: len(x.drop_duplicates(['col_2', 'col_3'])))\
.rename(columns={None:'n_unique'})
Here is how this works:
col_1
then we use apply()
to count the number of combinations in each group.apply()
gets the sub-data frame corresponding to each group and it then drops duplicates of the combinations of values of col_2
and col_3
and counts the resulting combinations.None
to ‘n_unique’
. See Renaming.We wish to compute the frequency of occurrence of each unique value.
One Column
We wish to compute the frequency of occurrence of each unique value of a column of a data frame (or a vector of values).
In this example, we wish to obtain the number of occurrences of each unique value of the column col_1
.
df['col_1'].value_counts()
Here is how this works:
value_counts()
to obtain the number of occurrences of each unique value in the column col_1
.value_counts()
is a Series where the index holds the unique values and the values are the integer counts of the number of occurrences of the corresponding unique value.Extension: Most Frequent Value
df_2 = df\
.groupby('col_1', as_index=False)\
.agg(col_2_mode = ('col_2', pd.Series.mode))
Here is how this works:
mode()
method of Pandas Series to identify the most frequent value.mode()
inside agg()
, which is called after groupby()
to obtain a summary for each group. See Aggregating.Extension: Ignore Missing
df['col_1'].value_counts(dropna=True)
Here is how this works:
dropna
argument of the value_counts()
method to include missing values i.e. include a single unique value for missing values whose frequency is the number of missing values in the column.Multiple Columns
We wish to compute the frequency of occurrence of each unique combination of values of a set of columns of a data frame.
In this example, we wish to obtain the number of occurrences of each unique combination of values of the columns col_1
and col_2
.
df_2 = df.value_counts(['col_1', 'col_2']).reset_index()
Here is how this works:
value_counts()
to obtain the frequency of occurrence of each unique combination of values.value_counts()
:df
and on which the method is called.col_1
value_counts()
in this case is a series with a MultiIndex
. We call reset_index()
to convert the series in to a data frame where the MultiIndex
is converted to columns. This can be ignored in data inspection scenarios.value_counts()
is converted to a data frame, we often need to rename the count column via something like .rename(columns={0: 'n_unique'})
. See Renaming.Alternative: Traditional Aggregation
df_2 = df\
.groupby(['col_1', 'col_2'], as_index=False)\
.size()\
.sort_values(by='size', ascending=False)
Here is how this works:
groupby()
to create groups with the columns whose combinations we wish to create a frequency table of. See Aggregating.size()
on the groups to compute the size of each group. See Length.sort_values()
to sort in descending order of group size. See Sorting.We wish to compute the ratio between the number of occurrences of each unique value to the total number of occurrences.
One Column
We wish to compute the ratio between the number of occurrences of each unique value of a column of a data frame to the length of the column.
In this example, we wish to obtain the ratio between the number of occurrences of each unique value of the column col_1
of the data frame df
to the length of the column.
df['col_1'].value_counts(normalize=True)
Here is how this works:
value_counts()
with the argument normalize=True
.value_counts()
excludes missing values, so the proportions might not add up to 1. To get around that we can set the argument dropna=False
which will add a single unique value for missing valuesUnique Combinations
We wish to compute the ratio between the number of occurrences of each unique combination of values of a set of columns of a data frame to the length of the data frame.
In this example, we wish to obtain the ratio between the number of occurrences of each unique combination of the values of the columns col_1
and col_2
of the data frame df
to the length of the data frame.
df.value_counts(['col_1', 'col_2'], normalize=True)
Here is how this works:
This works similarly to the one column scenario above except that we pass to value_counts() a list of the columns whose value combinations we are interested in, which here is ['col_1', 'col_2']
.