We wish to obtain or learn about the unique values in data.

In particular, this section covers the following:

- In
**Get Unique Values**, we cover how to obtain the actual unique values. - In
**Count Unique Value**, we cover how to obtain the number of possible unique values. - In
**Occurrence Frequency**, we cover how to compute the frequency of occurrence of each unique value. - In
**Occurrence Proportion**, we cover how to compute the ratio between the number of occurrences of each unique value to the total number of occurrences.

In each of those, we cover two scenarios:

**One Column**: The unique values a single column (i.e. a vector) can take.**Multiple Columns**: The unique combinations a set of columns can take.

We wish to obtain unique values.

**One Column**

We wish to obtain the unique values in one column of a data frame (or in one vector of values).

In this example, we wish to obtain the unique values of the column `col_1`

of the data frame `df`

.

```
df['col_1'].unique()
```

Here is how this works:

We use the method `unique()`

to obtain an array containing the unique values taken by a column of a data frame (i.e. by a vector of values).

*Alternative: Non-Series Vector*

```
lst = ['a', 'a', 'b', 'b']
pd.unique(lst)
```

Here is how this works:

If we wish to find the unique values in a non-Series array-like object e.g. a list or an array, we call unique via its package identifier i.e. `pd.unique()`

.

*Alternative: Convert to Set*

In this example, we wish to obtain the unique values of the column `col_2`

for each group, where the groups are specified by the column `col_1`

.

```
df_2 = df \
.groupby('col_1') \
.agg(col_2_unique=('col_2', set))
```

Here is how this works:

By definition, elements of a set are unique. Therefore, we can obtain unique elements by converting an array-like object to a set.

**Multiple Columns**

We wish to obtain the unique combinations of values of a set of columns of a data frame.

```
df = pd.DataFrame(
{'col_1': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'col_2': ['a', 'a', 'b', 'b', 'a', 'a', 'b', 'b'],
'col_3': ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b']})
df_2 = df.loc[:, ['col_1', 'col_2']].drop_duplicates()
```

Here is how this works:

- In
`loc[:, ['col_1', 'col_2']]`

, we produce a sub-data-frame containing the columns whose value combinations we are interested in obtaining. See Selecting by Name. - We then call
`drop_duplicates()`

on the resulting sub data frame to get the unique combinations that the selected columns take in the data. See Duplicates.

*Extension: Unique Rows*

```
df = pd.DataFrame(
{'col_1': ['a', 'a', 'a', 'b', 'b', 'a'],
'col_2': ['a', 'b', 'b', 'a', 'b', 'a']})
df_2 = df.drop_duplicates()
```

Here is how this works:

If we wish to find unique combinations of all columns of the data frame, then we simply call `drop_duplicates()`

on the data frame. See Duplicates.

We wish to obtain the number of possible unique values.

**One Column**

We wish to obtain the number of possible unique values in a column of a data frame (or in one vector of values).

In this example, we have a data frame `df`

and we wish to obtain the number of unique values of the column `col_2`

per group, where the groups are specified by the column `col_1`

.

```
df = pd.DataFrame(
{'col_1': ['a', 'a', 'b', 'b'],
'col_2': [10, 15, 20, 20]})
df_2 = df.groupby('col_1')['col_2'].nunique()
```

Here is how this works:

- We use the
`nunique()`

method to compute the number of unique values of the column`col_2`

for each group. - The input to
`nunique()`

is a Series and the output is a single integer value. - When called on a grouped Series, like in this case, the output is a Series of integer values.
- By default,
`nunique()`

excludes missing values in the unique counts. To retain missing values, i.e. to let missing values be counted as one unique value, we can set the argument dropna of`nunique()`

to`dropna=False`

.

**Multiple Columns**

We wish to obtain the number of possible unique combinations of a set of columns of a data frame.

In this example, we have a data frame `df`

and we wish to obtain the number of unique combinations of the columns `col_2`

and `col_3`

per group, where the groups are specified by the column `col_1`

.

```
df = pd.DataFrame(
{'col_1': ['a', 'a', 'a', 'b', 'b', 'b'],
'col_2': [10, 10, 15, 20, 20, 20],
'col_3': [1, 1, 2, 3, 4, 5]})
df_2 = df\
.drop_duplicates(['col_1', 'col_2', 'col_3'])\
.groupby('col_1', as_index=False)\
.agg(n_unique=('col_3', 'size'))
```

Here is how this works:

- We use
`drop_duplicates()`

to retain one instance of each`<col_1, col_2, col_3>`

combination. - We then group by
`col_1`

, and use`agg()`

and`size()`

to count the number of combinations of`col_2`

and`col_3`

for each group. - Even though
`size()`

counts the number of rows, we still need to pass a column name. This is intuitively unnecessary but is just how`agg()`

works. See Aggregating.

*Alternative: via Sequential Grouping*

```
df = pd.DataFrame(
{'col_1': ['a', 'a', 'a', 'b', 'b', 'b'],
'col_2': [10, 10, 15, 20, 20, 20],
'col_3': [1, 1, 2, 3, 4, 5]})
df_2 = df\
.groupby(['col_1', 'col_2', 'col_3'], as_index=False)\
.agg('size')\
.groupby('col_1', as_index=False)\
.agg(n_unique=('col_3', 'size'))
```

Here is how this works:

- To compute the number of unique combinations of multiple columns for each group, we carry out two aggregations in sequence.
- In the first aggregation, we group by both the grouping columns, which here is
`col_1`

, and the columns whose unique combinations we wish to count, which here are`col_2`

and`col_3`

. The purpose of this is to keep just one instance of each`<col_1, col_2, col_3>`

combination. - In the second aggregation, we count the number of combinations of
`col_2`

and`col_3`

for each group specified by the value of`col_1`

.

*Alternative: via Counting Unique Combination*

```
df = pd.DataFrame(
{'col_1': ['a', 'a', 'a', 'b', 'b', 'b'],
'col_2': [10, 10, 15, 20, 20, 20],
'col_3': [1, 1, 2, 3, 4, 5]})
df_2 = df\
.groupby('col_1', as_index=False)\
.apply(lambda x: len(x.drop_duplicates(['col_2', 'col_3'])))\
.rename(columns={None:'n_unique'})
```

Here is how this works:

- This is perhaps the most straight forward of the solutions provided here but is a bit more challenging to implement correctly.
- We start by grouping by
`col_1`

then we use`apply()`

to count the number of combinations in each group. `apply()`

gets the sub-data frame corresponding to each group and it then drops duplicates of the combinations of values of`col_2`

and`col_3`

and counts the resulting combinations.- Finally, we set the name of the resulting column form
`None`

to`‘n_unique’`

. See Renaming.

We wish to compute the frequency of occurrence of each unique value.

**One Column**

We wish to compute the frequency of occurrence of each unique value of a column of a data frame (or a vector of values).

In this example, we wish to obtain the number of occurrences of each unique value of the column `col_1`

.

```
df['col_1'].value_counts()
```

Here is how this works:

- We use the method
`value_counts()`

to obtain the number of occurrences of each unique value in the column`col_1`

. - The output of
`value_counts()`

is a Series where the index holds the unique values and the values are the integer counts of the number of occurrences of the corresponding unique value.

*Extension: Most Frequent Value*

```
df_2 = df\
.groupby('col_1', as_index=False)\
.agg(col_2_mode = ('col_2', pd.Series.mode))
```

Here is how this works:

- The most frequent value in a vector of values is often referred to as the
*mode*. - We use the
`mode()`

method of Pandas Series to identify the most frequent value. - We call
`mode()`

inside`agg()`

, which is called after`groupby()`

to obtain a summary for each group. See Aggregating.

*Extension: Ignore Missing*

```
df['col_1'].value_counts(dropna=True)
```

Here is how this works:

- Missing values are excluded by default.
- We can use the
`dropna`

argument of the`value_counts()`

method to include missing values i.e. include a single unique value for missing values whose frequency is the number of missing values in the column.

**Multiple Columns**

We wish to compute the frequency of occurrence of each unique combination of values of a set of columns of a data frame.

In this example, we wish to obtain the number of occurrences of each unique combination of values of the columns `col_1`

and `col_2`

.

```
df_2 = df.value_counts(['col_1', 'col_2']).reset_index()
```

Here is how this works:

- We use the function
`value_counts()`

to obtain the frequency of occurrence of each unique combination of values. - We pass to
`value_counts()`

:- The data frame, which here is
`df`

and on which the method is called. - The names of the column whose unique value occurrence frequency we wish to obtain, which here is
`col_1`

- The data frame, which here is
- The output of
`value_counts()`

in this case is a series with a`MultiIndex`

. We call`reset_index()`

to convert the series in to a data frame where the`MultiIndex`

is converted to columns. This can be ignored in data inspection scenarios. - While we do not show that here, if the series output of
`value_counts()`

is converted to a data frame, we often need to rename the count column via something like`.rename(columns={0: 'n_unique'})`

. See Renaming.

*Alternative: Traditional Aggregation*

```
df_2 = df\
.groupby(['col_1', 'col_2'], as_index=False)\
.size()\
.sort_values(by='size', ascending=False)
```

Here is how this works:

- We use
`groupby()`

to create groups with the columns whose combinations we wish to create a frequency table of. See Aggregating. - We then use
`size()`

on the groups to compute the size of each group. See Length. - We use
`sort_values()`

to sort in descending order of group size. See Sorting.

We wish to compute the ratio between the number of occurrences of each unique value to the total number of occurrences.

**One Column**

We wish to compute the ratio between the number of occurrences of each unique value of a column of a data frame to the length of the column.

In this example, we wish to obtain the ratio between the number of occurrences of each unique value of the column `col_1`

of the data frame `df`

to the length of the column.

```
df['col_1'].value_counts(normalize=True)
```

Here is how this works:

- In order to obtain the proportions of the total number of rows that take each possible value of the non-numeric column, we call
`value_counts()`

with the argument`normalize=True`

. - Note that
`value_counts()`

excludes missing values, so the proportions might not add up to 1. To get around that we can set the argument`dropna=False`

which will add a single unique value for missing values

**Unique Combinations**

We wish to compute the ratio between the number of occurrences of each unique combination of values of a set of columns of a data frame to the length of the data frame.

In this example, we wish to obtain the ratio between the number of occurrences of each unique combination of the values of the columns `col_1`

and `col_2`

of the data frame `df`

to the length of the data frame.

```
df.value_counts(['col_1', 'col_2'], normalize=True)
```

Here is how this works:

This works similarly to the one column scenario above except that we pass to value_counts() a list of the columns whose value combinations we are interested in, which here is `['col_1', 'col_2']`

.

PYTHON