Filter Groups By Value Rank

For each group in a grouped data frame, we wish to filter rows by the rank of a value of a specific column.

This section is organized to cover the most common scenarios as follows:

Largest where we cover how to return the row (or rows) where a particular column has its largest value for each group.
Smallest where we cover how to return the row (or rows) where a particular column has its smallest value for each group.
nth where we cover how to return the row (or rows) where a particular column has its nth largest or smallest value for each group.
Largest Values where we cover how to return the rows where a particular column has its largest values, defined by a count or a proportion, for each group.
Smallest Values where we cover how to return the rows where a particular column has its smallest values, defined by a count or a proportion, for each group.
Range where we cover how to return the rows where a particular column takes a value whose rank falls in a given range of value ranks for each group.
List where we cover how to return the rows where a particular column takes a value whose rank falls in a given list of value ranks for each group.

Largest

We wish to return the row (or rows) where a particular column has its largest value for each group.

In this example, we wish to return the row (or rows) of each group where the column ‘col_2’ has its max value for the group. The groups are defined by the column ‘col_1’.

df_2 = (df
        .groupby('col_1')
        .apply(lambda x: x.loc[x['col_2'] == x['col_2'].max()])
        .reset_index(drop=True))

Here is how this works:

We use apply() to execute the filtering logic on each of the groups created by df.groupby('col_1').
For each group, we compare the value of 'col_2' to the max value of 'col_2' for the group which is computed via x['col_2'].max().
We get a patchy index i.e. rows have their original index value. To get a fresh index, we add .reset_index(drop=True) to the chain. Please omit if your situation requires keeping the original index.
See By Value for a more detailed description.

Alternatively,

df_2 = (df
        .loc[df['col_2'] == (df
                             .groupby('col_1')['col_2']
                             .transform('max'))]
        .reset_index(drop=True))

Here is how this works:

We use transform() to create a new column where each value is the maximum value for the column col_2 for the group. See Grouped Transformation.
We compare that resulting Series of mean group values with col_2 and return True where col_2 takes a value greater than the mean value for the group. The corresponding columns are then returned by loc[].
See By Value for a more detailed description.

Smallest

We wish to return the row (or rows) where a particular column has its smallest value for each group.

In this example, we wish to return the row (or rows) of each group where the column ‘col_2’ has its smallest value for the group. The groups are defined by the column ‘col_1’.

df_2 = (df
        .groupby('col_1')
        .apply(lambda x: x.loc[x['col_2'] == x['col_2'].min()])
        .reset_index(drop=True))

Here is how this works:

This works similarly to the “Largest” scenario described above.

Alternatively,

df_2 = (df
        .loc[df['col_2'] == (df
                             .groupby('col_1')['col_2']
                             .transform('min'))]
        .reset_index(drop=True))

Here is how this works:

This works similarly to the “Largest” scenario described above.

This works similarly to the “nth / Smallest” scenario above except that we set the argument ascending of rank() to ascending=False so that we get a ranking that starts at the largest value i.e. so that rank 1 is the largest not the smallest value.

Largest Values

We wish to return the rows where a particular column has its largest values, defined by a count or a proportion, for each group.

Count

We have a data frame that is grouped by one or more columns. For each group, we wish to return the rows for which a particular column takes its n largest values.

In this example, we wish to return the rows from each group where the column ‘col_2' has its 3 largest values. The groups are defined by the column ‘col_1'.

df_2 = (df
        .groupby('col_1')
        .apply(lambda x: x \
               .loc[x['col_2']
               .rank(ascending=False, method='dense')
               .le(3)])
        .reset_index(drop=True))

Here is how this works:

In x['col_2'].rank(ascending=False, method='dense') we compute the dense ranking for the values of the column ‘col_2’ in descending order i.e. rank 1 is assigned to the largest value (the default ascending=True is to assign rank 1 to the smallest value). Dense ranking is different from regular ranking in that it leaves no gaps between ranks. See Ranking.
le() is a convenience function for carrying out a less than or equal comparison. See Conditional Operations.
We compare the dense rank given to each row with the value 3 via le(3) to return any row where the rank of the value of the column ‘col_2‘ is less than or equal 3.
We get a patchy index i.e. rows have their original index value. To get a fresh index, we add .reset_index(drop=True) to the chain. Please omit if your situation requires keeping the original index.
See the “nth” scenario above for more details.

Alternatively:

df_d = (df
        .groupby('col_1')
        .apply(lambda x: x.nlargest(3, 'col_2'))
        .reset_index(drop=True))

Here is how this works:

nlargest() returns the rows of a data frame where a particular column (or combination of columns) have their highest values. The number of rows to return is specified by the argument n.
A grouped data frame (a DataFrameGroupBy object) has no nlargest() method. Therefore, we use apply() to apply nlargest() to each group (sub data frame).
In nlargest(3, 'col_2') we specify that we wish to return the 3 rows where the column ‘col_2’ takes its highest values (hence n=3).
Note that nlargest() applies to numeric data types only and not to object or categorical data types. For instance, should ‘col_2’ be a string (object) data type, we would get the error Column 'col_2' has dtype object, cannot use method 'nlargest' with this dtype
How about duplicates; i.e. multiple rows where the column of interest, here ‘col_2’, has the same value?
- If there are no duplicates in the values of the column of interest, nlargest() will return n rows each with a unique value for the column of interest.
- However, if there are duplicates, nlargest() will return the n rows where the column of interest has its highest values (some of which would have the same value for the column of interest). The number of unique values of the column of interest will be smaller than n. If the cut off occurs within a set of duplicates, the rows that appear first are prioritized (the default is keep=first).
- We can also resolve duplicates by prioritizing the last rows from the set of rows where the value of column of interest is the same and dropping the rest. We can do that by setting keep=last.
What to do in the case of duplicates?
- If we wish to get all the rows where the column of interest takes its n largest unique values (which is what we want to do in this scenario), then the solution above this that uses dense ranking is the way to go. Example input: 2, 2, 2, 3, 3, 4. Example Output: 2, 2, 2, 3, 3. For n = 2.
- If we wish to return exactly n rows and accept that some of the returned rows have duplicate values for the column(s) of interest and we also accept that some duplicate values may be dropped, then we can use nlargest() with keep=first or keep=last. Example input: 2, 3, 3, 4. Example Output: 2, 3. For n = 2.
- If we accept that some of the returned rows have duplicate values for the column(s) of interest but we require that if any row with one value for the column of interest is included then all rows with the same value must be included (i.e. do not drop duplicates beyond the cut off) and accept that we may get more than n rows, then we can use nlargest() with keep=all. Example input: 2, 3, 3, 4. Example Output: 2, 3, 3. For n = 2.

Proportion

df_c = (df
        .groupby('col_1')
        .apply(lambda x: x \
               .loc[x['col_2']
               .rank(ascending=False, method='dense')
               .le(x['col_2'].nunique() * 0.2)])
        .reset_index(drop=True))

Here is how this works:

nunique() returns the number of unique values that the column ‘col_2' takes. See Uniqueness.
x['col_2'].nunique() * 0.2 tell us how many values is 20% of the total number of possible values of the column ‘col_2'
The lambda function, returns True for rows where the dense rank of the value of ‘col_2' (for the group) is within the top 20% ranked values of ‘col_2’.
See the description of the “Count” scenario above for more details.

Smallest Values

We wish to return the rows where a particular column has its smallest values, defined by a count or a proportion, for each group.

In this example, we wish to return the rows from each group where the column ‘col_2'. The groups are defined by the column ‘col_1'.

df_c = (df
        .groupby('col_3')
        .apply(lambda x: x \
               .loc[x['col_8']
               .rank(method='dense')
               .le(3)])
        .reset_index(drop=True))

Here is how this works:

For each group, we return the rows where dense rank of the value of the column ‘col_2’ is within the first three values.
Since we are looking for the rows with the smallest n values, ranking in ascending order is appropriate (ascending=True is the default).
See the description of the “Count” scenario under “Largest Values” above for more details.
We can return a proportion rather than a count. See “Proportion” under “Largest Values” above.

Alternatively:

df_2 = (df
        .groupby('col_1')
        .apply(lambda x: x.nsmallest(3, 'col_2', keep='all'))
        .reset_index(drop=True))

Here is how this works:

We use nsmallest() to extract the rows with the smallest values from each group.
See the description of the alternate solution of the “Count” scenario under “Largest Values” above for more details.

Range

For each group, we wish to return the rows where a particular column takes a value whose rank falls in a given range of value ranks.

In this example, we wish to return rows that take the second through to the fifth ranked possible values of the column ‘col_2' where the largest value takes the first rank. The groups are defined by the column ‘col_1'.

df_2 = (df
        .groupby('col_1')
        .apply(lambda x: x \
               .loc[x['col_2']
               .rank(ascending=False, method='dense')
               .between(2,5)])
        .reset_index(drop=True))

Here is how this works:

We use apply() to execute the filtering logic on each of the groups created by df.groupby('col_1').
rank() returns a ranking of the rows by values of a given column. We set method=’dense’ to perform a type of ranking called dense ranking. Dense ranking is different from regular ranking in that it leaves no gaps between ranks. See Ranking.
We set ascending=False so that we get a ranking that starts at the largest value i.e. rank 1 is the largest not the smallest value.
We use the function between() to check if a value (the output of rank()) is within a given range (2 and 5 inclusive). See Numerical Operations.
We get a patchy index i.e. rows have their original index value. To get a fresh index, we add .reset_index(drop=True) to the chain. Please omit if your situation requires keeping the original index.

List

For each group, we wish to return the rows where a particular column takes a value whose rank is one of a set of value ranks.

In this example, for each group, we wish to return rows where the column ‘col_2' takes values that are ranked first, third or fifth among the possible values of the column ‘col_2' where the largest value takes the first rank. The groups are defined by the column ‘col_1'.

df_2 = (df
        .groupby('col_1')
        .apply(lambda x: x \
               .loc[x['col_2']
               .rank(ascending=False, method='dense')
               .isin([1, 3, 5])])
        .reset_index(drop=True))

Here is how this works:

We use isin() to check if the rank of the value of ‘col_2’ (computed over the group) for a row is in the list of our target value ranks [1, 3, 5].
See the description of the “Range” scenario above for more details.

Optima.io Reference beta