Grouped Sorting

We wish to sort in a grouping context. There are two common scenarios which we cover here:

  1. Sorting Groups where we sort the rows of a data frame in descending order of a property of the group they belong to, e.g. group size, where the groups are defined by another column of the data frame.
  2. Sorting Within Groups where we sort a grouped data frame such that sorting happens within groups.

Sorting Groups

We wish to sort the rows of a data frame in order of a property of the group they belong to where the groups are defined by another column of the data frame.

In this example, we wish to sort the rows of a data frame df in descending order of the size of the group they belong to where the groups are defined by the column col_1.

(df
 .assign(group_size=df.groupby('col_1').transform('size'))
 .sort_values('group_size', ascending=False))

Here is how this works:

  • We use assign() to add a new column called group_size to the data frame df.
  • We use transform() to compute the ‘size’ of each group of the grouped data frame created by df.groupby(’col_1’). In place of size, we can use any other group attribute e.g. sum of a particular column, as the sorting quantity. See Grouped Transformation for more details.
  • Finally we use sort_values() to sort the data frame df in descending order (hence ascending=False) of the value of ‘group_size’.
  • The resulting data frame will be sorted in descending order of the size of the group each row belongs to.
  • Note that the resulting data frame is not grouped (it is a DataFrame not a GroupedDataFrame). We cover how to add a column to a grouped data frame in Grouped Transformation.

Sorting Within Groups

We wish to sort a grouped data frame such that sorting happens within groups. By default, when sorting a grouped data frame, the grouping is ignored and the data frame is sorted just like it were not grouped. In this section, we cover how to sort within groups.

In this example, we wish to group the data frame df by the column 'col_1' and then sort each group by the values of the column ‘col_2' in ascending order.

(df
 .groupby('col_1', group_keys=False)
 .apply(lambda x: x.sort_values('col_2', ignore_index=True)))

Here is how this works:

  • So far we have been using the function sort_values() for nearly all data frame sorting scenarios. As it turns out, if we try to apply sort_values() to a grouped data frame, we would get an error: 'DataFrameGroupBy' object has no attribute 'sort_values’.
  • We use apply() to sort the rows of each group individually. apply() passes each group as a data frame to the lambda function within.
  • We apply sort_values() to each of those group data frames to sort in ascending order of ‘col_2’. We pass ignore_index=True so a new index is created with the desired sorting.

Alternatively

(df
 .sort_values('col_2', ignore_index=True)
 .groupby('col_3')
 .apply(print))

Here is how this works:

  • Grouping a data frame via groupby() doesn’t change the order of the rows. Therefore,
    • We can first sort via the sorting column of interest which is here done via sort_values('col_2')
    • and then group which is here done via groupby('col_3')
  • A DataFrameGroupBy object can’t be viewed like a regular DataFrame object, therefore we use apply(print) so we may view the groups (and verify that they are indeed sorted)
  • If we have the flexibility to sort before grouping we can follow this solution which is simpler than using apply() to sort individual groups as we did above.
PYTHON
I/O