For each group in a grouped data frame, we wish to filter rows by the rank of a value of a specific column.

This section is organized to cover the most common scenarios as follows:

**Largest**where we cover how to return the row (or rows) where a particular column has its largest value for each group.**Smallest**where we cover how to return the row (or rows) where a particular column has its smallest value for each group.**nth**where we cover how to return the row (or rows) where a particular column has its nth largest or smallest value for each group.**Largest Values**where we cover how to return the rows where a particular column has its largest values, defined by a count or a proportion, for each group.**Smallest Values**where we cover how to return the rows where a particular column has its smallest values, defined by a count or a proportion, for each group.**Range**where we cover how to return the rows where a particular column takes a value whose rank falls in a given range of value ranks for each group.**List**where we cover how to return the rows where a particular column takes a value whose rank falls in a given list of value ranks for each group.

We wish to return the row (or rows) where a particular column has its largest value for each group.

In this example, we wish to return the row (or rows) of each group where the column `‘col_2’`

has its max value for the group. The groups are defined by the column `‘col_1’`

.

```
df_2 = (df
.groupby('col_1')
.apply(lambda x: x.loc[x['col_2'] == x['col_2'].max()])
.reset_index(drop=True))
```

Here is how this works:

- We use
`apply()`

to execute the filtering logic on each of the groups created by`df.groupby('col_1')`

. - For each group, we compare the value of
`'col_2'`

to the max value of`'col_2'`

for the group which is computed via`x['col_2'].max()`

. - We get a patchy index i.e. rows have their original index value. To get a fresh index, we add
`.reset_index(drop=True)`

to the chain. Please omit if your situation requires keeping the original index. - See By Value for a more detailed description.

Alternatively,

```
df_2 = (df
.loc[df['col_2'] == (df
.groupby('col_1')['col_2']
.transform('max'))]
.reset_index(drop=True))
```

Here is how this works:

- We use
`transform()`

to create a new column where each value is the maximum value for the column`col_2`

for the group. See Grouped Transformation. - We compare that resulting
`Series`

of mean group values with`col_2`

and return`True`

where`col_2`

takes a value greater than the mean value for the group. The corresponding columns are then returned by`loc[]`

. - See By Value for a more detailed description.

We wish to return the row (or rows) where a particular column has its smallest value for each group.

In this example, we wish to return the row (or rows) of each group where the column `‘col_2’`

has its smallest value for the group. The groups are defined by the column `‘col_1’`

.

```
df_2 = (df
.groupby('col_1')
.apply(lambda x: x.loc[x['col_2'] == x['col_2'].min()])
.reset_index(drop=True))
```

Here is how this works:

This works similarly to the “Largest” scenario described above.

Alternatively,

```
df_2 = (df
.loc[df['col_2'] == (df
.groupby('col_1')['col_2']
.transform('min'))]
.reset_index(drop=True))
```

Here is how this works:

This works similarly to the “Largest” scenario described above.

We wish to return the row (or rows) where a particular column has its nth largest or smallest value for each group.

**Smallest**

In this example, we wish to return the row (or rows) of each group where the column `‘col_2'`

has its third smallest value for the group. The groups are defined by the column `‘col_1'`

.

```
df_2 = (df
.groupby('col_1')
.apply(lambda x: x \
.loc[x['col_2']
.rank(method='dense')
.eq(3)])
.reset_index(drop=True))
```

Here is how this works:

- We use
`apply()`

to execute the filtering logic on each of the groups created by`df.groupby('col_1')`

. `rank()`

returns a ranking of the rows by values of a given column. We set`method=’dense’`

to perform a type of ranking called dense ranking. Dense ranking is different from regular ranking in that it leaves no gaps between ranks. See Ranking.- The expression
`x['col_2'].rank(method='dense').eq(3)`

evaluates to`True`

for rows where the rank of the value of`‘col_2'`

is 3 i.e. at its third smallest value.

**Largest**

In this example, we wish to return the row (or rows) of each group where the column `‘col_2'`

. The groups are defined by the column `‘col_1'`

.

```
df_2 = (df
.groupby('col_3')
.apply(lambda x: x \
.loc[x['col_8']
.rank(ascending=False, method='dense')
.eq(3)])
.reset_index(drop=True))
```

Here is how this works:

This works similarly to the “nth / Smallest” scenario above except that we set the argument `ascending`

of `rank()`

to `ascending=False`

so that we get a ranking that starts at the largest value i.e. so that rank 1 is the largest not the smallest value.

We wish to return the rows where a particular column has its largest values, defined by a count or a proportion, for each group.

**Count**

We have a data frame that is grouped by one or more columns. For each group, we wish to return the rows for which a particular column takes its n largest values.

In this example, we wish to return the rows from each group where the column `‘col_2'`

has its 3 largest values. The groups are defined by the column `‘col_1'`

.

```
df_2 = (df
.groupby('col_1')
.apply(lambda x: x \
.loc[x['col_2']
.rank(ascending=False, method='dense')
.le(3)])
.reset_index(drop=True))
```

Here is how this works:

- In
`x['col_2'].rank(ascending=False, method='dense')`

we compute the dense ranking for the values of the column`‘col_2’`

in descending order i.e. rank 1 is assigned to the largest value (the default`ascending=True`

is to assign rank 1 to the smallest value). Dense ranking is different from regular ranking in that it leaves no gaps between ranks. See Ranking. `le()`

is a convenience function for carrying out a less than or equal comparison. See Conditional Operations.- We compare the dense rank given to each row with the value 3 via
`le(3)`

to return any row where the rank of the value of the column`‘col_2‘`

is less than or equal 3. - We get a patchy index i.e. rows have their original index value. To get a fresh index, we add
`.reset_index(drop=True)`

to the chain. Please omit if your situation requires keeping the original index. - See the “nth” scenario above for more details.

*Alternatively:*

```
df_d = (df
.groupby('col_1')
.apply(lambda x: x.nlargest(3, 'col_2'))
.reset_index(drop=True))
```

Here is how this works:

`nlargest()`

returns the rows of a data frame where a particular column (or combination of columns) have their highest values. The number of rows to return is specified by the argument`n`

.- A grouped data frame (a
`DataFrameGroupBy`

object) has no`nlargest()`

method. Therefore, we use`apply()`

to apply`nlargest()`

to each group (sub data frame). - In
`nlargest(3, 'col_2')`

we specify that we wish to return the 3 rows where the column`‘col_2’`

takes its highest values (hence`n=3`

). - Note that
`nlargest()`

applies to numeric data types only and not to object or categorical data types. For instance, should`‘col_2’`

be a string (object) data type, we would get the error`Column 'col_2' has dtype object, cannot use method 'nlargest' with this dtype`

- How about duplicates; i.e. multiple rows where the column of interest, here
`‘col_2’`

, has the same value?- If there are no duplicates in the values of the column of interest,
`nlargest()`

will return`n`

rows each with a unique value for the column of interest. - However, if there are duplicates,
`nlargest()`

will return the`n`

rows where the column of interest has its highest values (some of which would have the same value for the column of interest). The number of unique values of the column of interest will be smaller than`n`

. If the cut off occurs within a set of duplicates, the rows that appear first are prioritized (the default is`keep=first`

). - We can also resolve duplicates by prioritizing the last rows from the set of rows where the value of column of interest is the same and dropping the rest. We can do that by setting
`keep=last`

.

- If there are no duplicates in the values of the column of interest,
- What to do in the case of duplicates?
- If we wish to get all the rows where the column of interest takes its
`n`

largest unique values (which is what we want to do in this scenario), then the solution above this that uses dense ranking is the way to go. Example input: 2, 2, 2, 3, 3, 4. Example Output: 2, 2, 2, 3, 3. For`n = 2`

. - If we wish to return exactly
`n`

rows and accept that some of the returned rows have duplicate values for the column(s) of interest and we also accept that some duplicate values may be dropped, then we can use`nlargest()`

with`keep=first`

or`keep=last`

. Example input: 2, 3, 3, 4. Example Output: 2, 3. For`n = 2`

. - If we accept that some of the returned rows have duplicate values for the column(s) of interest but we require that if any row with one value for the column of interest is included then all rows with the same value must be included (i.e. do not drop duplicates beyond the cut off) and accept that we may get more than
`n`

rows, then we can use`nlargest()`

with`keep=all`

. Example input: 2, 3, 3, 4. Example Output: 2, 3, 3. For`n = 2`

.

- If we wish to get all the rows where the column of interest takes its

**Proportion**

```
df_c = (df
.groupby('col_1')
.apply(lambda x: x \
.loc[x['col_2']
.rank(ascending=False, method='dense')
.le(x['col_2'].nunique() * 0.2)])
.reset_index(drop=True))
```

Here is how this works:

`nunique()`

returns the number of unique values that the column`‘col_2'`

takes. See Uniqueness.`x['col_2'].nunique() * 0.2`

tell us how many values is 20% of the total number of possible values of the column`‘col_2'`

- The
`lambda`

function, returns`True`

for rows where the dense rank of the value of`‘col_2'`

(for the group) is within the top 20% ranked values of`‘col_2’`

. - See the description of the “Count” scenario above for more details.

We wish to return the rows where a particular column has its smallest values, defined by a count or a proportion, for each group.

In this example, we wish to return the rows from each group where the column `‘col_2'`

. The groups are defined by the column `‘col_1'`

.

```
df_c = (df
.groupby('col_3')
.apply(lambda x: x \
.loc[x['col_8']
.rank(method='dense')
.le(3)])
.reset_index(drop=True))
```

Here is how this works:

- For each group, we return the rows where dense rank of the value of the column
`‘col_2’`

is within the first three values. - Since we are looking for the rows with the smallest n values, ranking in ascending order is appropriate (
`ascending=True`

is the default). - See the description of the “Count” scenario under “Largest Values” above for more details.
- We can return a proportion rather than a count. See “Proportion” under “Largest Values” above.

*Alternatively:*

```
df_2 = (df
.groupby('col_1')
.apply(lambda x: x.nsmallest(3, 'col_2', keep='all'))
.reset_index(drop=True))
```

Here is how this works:

- We use
`nsmallest()`

to extract the rows with the smallest values from each group. - See the description of the alternate solution of the “Count” scenario under “Largest Values” above for more details.

For each group, we wish to return the rows where a particular column takes a value whose rank falls in a given range of value ranks.

In this example, we wish to return rows that take the second through to the fifth ranked possible values of the column `‘col_2'`

where the largest value takes the first rank. The groups are defined by the column `‘col_1'`

.

```
df_2 = (df
.groupby('col_1')
.apply(lambda x: x \
.loc[x['col_2']
.rank(ascending=False, method='dense')
.between(2,5)])
.reset_index(drop=True))
```

Here is how this works:

- We use
`apply()`

to execute the filtering logic on each of the groups created by`df.groupby('col_1')`

. `rank()`

returns a ranking of the rows by values of a given column. We set`method=’dense’`

to perform a type of ranking called dense ranking. Dense ranking is different from regular ranking in that it leaves no gaps between ranks. See Ranking.- We set
`ascending=False`

so that we get a ranking that starts at the largest value i.e. rank 1 is the largest not the smallest value. - We use the function
`between()`

to check if a value (the output of`rank()`

) is within a given range (2 and 5 inclusive). See Numerical Operations. - We get a patchy index i.e. rows have their original index value. To get a fresh index, we add
`.reset_index(drop=True)`

to the chain. Please omit if your situation requires keeping the original index.

For each group, we wish to return the rows where a particular column takes a value whose rank is one of a set of value ranks.

In this example, for each group, we wish to return rows where the column `‘col_2'`

takes values that are ranked first, third or fifth among the possible values of the column `‘col_2'`

where the largest value takes the first rank. The groups are defined by the column `‘col_1'`

.

```
df_2 = (df
.groupby('col_1')
.apply(lambda x: x \
.loc[x['col_2']
.rank(ascending=False, method='dense')
.isin([1, 3, 5])])
.reset_index(drop=True))
```

Here is how this works:

- We use
`isin()`

to check if the rank of the value of`‘col_2’`

(computed over the group) for a row is in the list of our target value ranks`[1, 3, 5]`

. - See the description of the “Range” scenario above for more details.

PYTHON