For each group in a grouped data frame, we wish to filter rows by the rank of a value of a specific column.
This section is organized to cover the most common scenarios as follows:
We wish to return the row (or rows) where a particular column has its largest value for each group.
In this example, we wish to return the row (or rows) of each group where the column ‘col_2’
has its max value for the group. The groups are defined by the column ‘col_1’
.
df_2 = (df
.groupby('col_1')
.apply(lambda x: x.loc[x['col_2'] == x['col_2'].max()])
.reset_index(drop=True))
Here is how this works:
apply()
to execute the filtering logic on each of the groups created by df.groupby('col_1')
.'col_2'
to the max value of 'col_2'
for the group which is computed via x['col_2'].max()
..reset_index(drop=True)
to the chain. Please omit if your situation requires keeping the original index.Alternatively,
df_2 = (df
.loc[df['col_2'] == (df
.groupby('col_1')['col_2']
.transform('max'))]
.reset_index(drop=True))
Here is how this works:
transform()
to create a new column where each value is the maximum value for the column col_2
for the group. See Grouped Transformation.Series
of mean group values with col_2
and return True
where col_2
takes a value greater than the mean value for the group. The corresponding columns are then returned by loc[]
.We wish to return the row (or rows) where a particular column has its smallest value for each group.
In this example, we wish to return the row (or rows) of each group where the column ‘col_2’
has its smallest value for the group. The groups are defined by the column ‘col_1’
.
df_2 = (df
.groupby('col_1')
.apply(lambda x: x.loc[x['col_2'] == x['col_2'].min()])
.reset_index(drop=True))
Here is how this works:
This works similarly to the “Largest” scenario described above.
Alternatively,
df_2 = (df
.loc[df['col_2'] == (df
.groupby('col_1')['col_2']
.transform('min'))]
.reset_index(drop=True))
Here is how this works:
This works similarly to the “Largest” scenario described above.
We wish to return the row (or rows) where a particular column has its nth largest or smallest value for each group.
Smallest
In this example, we wish to return the row (or rows) of each group where the column ‘col_2'
has its third smallest value for the group. The groups are defined by the column ‘col_1'
.
df_2 = (df
.groupby('col_1')
.apply(lambda x: x \
.loc[x['col_2']
.rank(method='dense')
.eq(3)])
.reset_index(drop=True))
Here is how this works:
apply()
to execute the filtering logic on each of the groups created by df.groupby('col_1')
.rank()
returns a ranking of the rows by values of a given column. We set method=’dense’
to perform a type of ranking called dense ranking. Dense ranking is different from regular ranking in that it leaves no gaps between ranks. See Ranking.x['col_2'].rank(method='dense').eq(3)
evaluates to True
for rows where the rank of the value of ‘col_2'
is 3 i.e. at its third smallest value.Largest
In this example, we wish to return the row (or rows) of each group where the column ‘col_2'
. The groups are defined by the column ‘col_1'
.
df_2 = (df
.groupby('col_3')
.apply(lambda x: x \
.loc[x['col_8']
.rank(ascending=False, method='dense')
.eq(3)])
.reset_index(drop=True))
Here is how this works:
This works similarly to the “nth / Smallest” scenario above except that we set the argument ascending
of rank()
to ascending=False
so that we get a ranking that starts at the largest value i.e. so that rank 1 is the largest not the smallest value.
We wish to return the rows where a particular column has its largest values, defined by a count or a proportion, for each group.
Count
We have a data frame that is grouped by one or more columns. For each group, we wish to return the rows for which a particular column takes its n largest values.
In this example, we wish to return the rows from each group where the column ‘col_2'
has its 3 largest values. The groups are defined by the column ‘col_1'
.
df_2 = (df
.groupby('col_1')
.apply(lambda x: x \
.loc[x['col_2']
.rank(ascending=False, method='dense')
.le(3)])
.reset_index(drop=True))
Here is how this works:
x['col_2'].rank(ascending=False, method='dense')
we compute the dense ranking for the values of the column ‘col_2’
in descending order i.e. rank 1 is assigned to the largest value (the default ascending=True
is to assign rank 1 to the smallest value). Dense ranking is different from regular ranking in that it leaves no gaps between ranks. See Ranking.le()
is a convenience function for carrying out a less than or equal comparison. See Conditional Operations.le(3)
to return any row where the rank of the value of the column ‘col_2‘
is less than or equal 3..reset_index(drop=True)
to the chain. Please omit if your situation requires keeping the original index.Alternatively:
df_d = (df
.groupby('col_1')
.apply(lambda x: x.nlargest(3, 'col_2'))
.reset_index(drop=True))
Here is how this works:
nlargest()
returns the rows of a data frame where a particular column (or combination of columns) have their highest values. The number of rows to return is specified by the argument n
.DataFrameGroupBy
object) has no nlargest()
method. Therefore, we use apply()
to apply nlargest()
to each group (sub data frame).nlargest(3, 'col_2')
we specify that we wish to return the 3 rows where the column ‘col_2’
takes its highest values (hence n=3
).nlargest()
applies to numeric data types only and not to object or categorical data types. For instance, should ‘col_2’
be a string (object) data type, we would get the error Column 'col_2' has dtype object, cannot use method 'nlargest' with this dtype
‘col_2’
, has the same value?nlargest()
will return n
rows each with a unique value for the column of interest.nlargest()
will return the n
rows where the column of interest has its highest values (some of which would have the same value for the column of interest). The number of unique values of the column of interest will be smaller than n
. If the cut off occurs within a set of duplicates, the rows that appear first are prioritized (the default is keep=first
).keep=last
.n
largest unique values (which is what we want to do in this scenario), then the solution above this that uses dense ranking is the way to go. Example input: 2, 2, 2, 3, 3, 4. Example Output: 2, 2, 2, 3, 3. For n = 2
.n
rows and accept that some of the returned rows have duplicate values for the column(s) of interest and we also accept that some duplicate values may be dropped, then we can use nlargest()
with keep=first
or keep=last
. Example input: 2, 3, 3, 4. Example Output: 2, 3. For n = 2
.n
rows, then we can use nlargest()
with keep=all
. Example input: 2, 3, 3, 4. Example Output: 2, 3, 3. For n = 2
.Proportion
df_c = (df
.groupby('col_1')
.apply(lambda x: x \
.loc[x['col_2']
.rank(ascending=False, method='dense')
.le(x['col_2'].nunique() * 0.2)])
.reset_index(drop=True))
Here is how this works:
nunique()
returns the number of unique values that the column ‘col_2'
takes. See Uniqueness.x['col_2'].nunique() * 0.2
tell us how many values is 20% of the total number of possible values of the column ‘col_2'
lambda
function, returns True
for rows where the dense rank of the value of ‘col_2'
(for the group) is within the top 20% ranked values of ‘col_2’
.We wish to return the rows where a particular column has its smallest values, defined by a count or a proportion, for each group.
In this example, we wish to return the rows from each group where the column ‘col_2'
. The groups are defined by the column ‘col_1'
.
df_c = (df
.groupby('col_3')
.apply(lambda x: x \
.loc[x['col_8']
.rank(method='dense')
.le(3)])
.reset_index(drop=True))
Here is how this works:
‘col_2’
is within the first three values.ascending=True
is the default).Alternatively:
df_2 = (df
.groupby('col_1')
.apply(lambda x: x.nsmallest(3, 'col_2', keep='all'))
.reset_index(drop=True))
Here is how this works:
nsmallest()
to extract the rows with the smallest values from each group.For each group, we wish to return the rows where a particular column takes a value whose rank falls in a given range of value ranks.
In this example, we wish to return rows that take the second through to the fifth ranked possible values of the column ‘col_2'
where the largest value takes the first rank. The groups are defined by the column ‘col_1'
.
df_2 = (df
.groupby('col_1')
.apply(lambda x: x \
.loc[x['col_2']
.rank(ascending=False, method='dense')
.between(2,5)])
.reset_index(drop=True))
Here is how this works:
apply()
to execute the filtering logic on each of the groups created by df.groupby('col_1')
.rank()
returns a ranking of the rows by values of a given column. We set method=’dense’
to perform a type of ranking called dense ranking. Dense ranking is different from regular ranking in that it leaves no gaps between ranks. See Ranking.ascending=False
so that we get a ranking that starts at the largest value i.e. rank 1 is the largest not the smallest value.between()
to check if a value (the output of rank()
) is within a given range (2 and 5 inclusive). See Numerical Operations..reset_index(drop=True)
to the chain. Please omit if your situation requires keeping the original index.For each group, we wish to return the rows where a particular column takes a value whose rank is one of a set of value ranks.
In this example, for each group, we wish to return rows where the column ‘col_2'
takes values that are ranked first, third or fifth among the possible values of the column ‘col_2'
where the largest value takes the first rank. The groups are defined by the column ‘col_1'
.
df_2 = (df
.groupby('col_1')
.apply(lambda x: x \
.loc[x['col_2']
.rank(ascending=False, method='dense')
.isin([1, 3, 5])])
.reset_index(drop=True))
Here is how this works:
isin()
to check if the rank of the value of ‘col_2’
(computed over the group) for a row is in the list of our target value ranks [1, 3, 5]
.