Column Selection

In this page, we show how to identify the columns on each of which we will apply data aggregation logic.

We will cover the following scenarios:

  • All Columns where we cover how to apply a data aggregation operation to each column in a data frame.
  • Explicit Selection where we cover how to apply a data aggregation operation to each of a set of explicitly selected columns of a data frame e.g. selecting columns by spelling out their names.
  • Implicit Selection where we cover how to apply a data aggregation operation to each of a set of implicitly selected columns of a data frame e.g. selecting columns whose names contain a certain substring.
  • Exclude Columns where we cover how to apply a data aggregation operation to each column of a data frame except for a set of excluded columns.

This section is complemented by

  • Function Specification where we cover how to specify one or more data aggregation operations to apply to the selected set of columns
  • Output Naming where we cover how to specify the name(s) of output column(s) created by the implicit data aggregation operations.

Note: In this page, we demonstrate implicit data aggregation in a context where a data frame is grouped by one column.

  • We can drop the call to groupby() to aggregate over the entire data frame without grouping in which case the output will be a Series. See Basic Aggregation.
  • We can also group by multiple columns. See Grouping.

All Columns

We wish to apply a data aggregation operation to each column of a data frame.

In this example, we have a data frame df and we wish to count the number of unique values in each column for each group where the groups are defined by the values of the column col_1.

df_2 = df.groupby('col_1').agg('nunique')

Here is how this works:

  • The groupby() method groups the rows of the data frame df by the values in the column col_1. The data aggregation operations applied by the subsequent call to agg() are, therefore, applied to each group separately. See Summary Table.
  • We call agg() on the entire grouped data frame (DataFrameGroupBy object) produced by df.groupby('col_1') which means that the data aggregation operation passed to agg(), in this case nunique(), is going to be applied to each column in each group separately.
  • The output is a new data frame df_2 with one row for each unique value of the column col_1, and a column for each column in the original df data frame containing the number of unique values in that column for the corresponding group of rows.
  • The output columns have the names of the original columns. See Output Naming for how to specify the names of the output columns.

Alternative: Via Pandas Convenience Function

df_2 = df.groupby('col_1').nunique()

Here is how this works:

  • This code performs the same data aggregation operation as the primary solution above. However, we use the Pandas convenience data aggregation method nunique() instead of agg().
  • In Pandas, the output of groupby() (a DataFrameGroupBy object) has a set of convenience methods to carry out the most common data aggregation operations which can be used instead of agg(). One such method is nunique(). Pandas applies the method to each column for each group behind the scenes. See Common Aggregation Operations.

Explicit Selection

We wish to apply a data aggregation operation to each of a set of explicitly selected columns.

In this example, we have a data frame df and we wish to count the number of unique values of the columns col_2 and col_4 for each group, where the groups are defined by the values of the column col_1.

df_2 = (df
        .groupby('col_1')[['col_2', 'col_4']]
        .agg('nunique'))

Here is how this works:

  • The groupby() method groups the rows of the data frame df by the values in the column col_1. The data aggregation operations applied by the subsequent call to agg() are, therefore, applied to each group separately. See Summary Table.
  • In ['col_2', 'col_4'], we use the bracket operator to select the desired columns by their name. It is worth noting that we can’t use the selector loc[] on the output of groupby(). See Basic Selection for a coverage of explicit column selection scenarios,.
  • The subsequent call to agg() applies the function nunique() to the selected columns (col_2 and col_4) for each group.
  • The output is a new data frame df_2 with one row for each unique value of the column col_1, and two columns col_2 and col_4 containing the number of unique values in that column for the corresponding group of rows.

Alternative: Select then Group

df_2 = (df
        .loc[:, ['col_1', 'col_2', 'col_4']]
        .groupby('col_1')
        .nunique()

Here is how this works:

  • This works similarly to the primary solution above except that we select via loc[] then group via groupby(). We must include the grouping column, here col_1, in our selection.
  • When we call nunique(), the columns available for aggregation would be the selected non grouping columns i.e. col_2 and col_4.

Implicit Selection

We wish to apply a data aggregation operation to each column in a set of implicitly selected columns. Implicit column selection is when we do not spell out the column names or positions explicitly but rather identify the columns via a property of their name or their data.

In this example, we have a data frame df and we wish to compute the sum of values of all numerical columns for each group, where the groups are defined by the values of the column col_1.

selected_cols = df.select_dtypes('number').columns
df_2 = (df
        .groupby('col_1')[selected_cols]
        .agg('sum'))

Here is how this works:

  • In df.select_dtypes('number').columns, we select all columns whose data type is numeric (i.e. integer or floating point). See Implicit Selection for a coverage of the most common scenarios of implicit column selection including by name pattern, data type, and Criteria satisfied by the column’s data.
  • The groupby() method groups the rows of the data frame df by the values in the column col_1. The data aggregation operations applied by the subsequent call to agg() are, therefore, applied to each group separately. See Summary Table.
  • In [selected_cols], we use the bracket operator to select the columns that were identified earlier for each group.
  • The subsequent call to sum() computes the sum for each of the selected columns for each group.
  • The output is a new data frame df_2, with one row for each unique value in the col_1 column, and a column for each numeric column in the original df data frame containing the sum of the values in that column for the corresponding group of rows.

Alternative: Via Pandas Convenience Function

selected_cols = df.select_dtypes('number').columns
df_2 = df.groupby('col_1')[selected_cols].sum()

Here is how this works:

  • This code performs the same data aggregation operation as the primary solution above. However, we use the Pandas convenience data aggregation method sum() instead of agg().
  • In Pandas, the output of groupby() (a DataFrameGroupBy object) has a set of convenience methods to carry out the most common data aggregation operations which can be used instead of agg(). One such method is nunique(). Pandas applies the method to each column for each group behind the scenes. See Common Aggregation Operations.

Alternative: Select then Group

In this example, we have a data frame df and we wish to compute the sum of values of columns whose names start with ‘x_’ for each group, where the groups are defined by the values of the column col_1.

df_2 = (df
        .loc[:, df.columns.str.startswith('x_')]
        .groupby('x_col_1')
        .sum())

Here is how this works:

  • This works similarly to the primary solution above except that we select via loc[] then group via groupby(). In this example, the grouping column happens to be returned by the same implicit column selection operation.
  • When we call sum(), the columns available for aggregation would be the selected non grouping columns i.e. col_2 and col_4.

Exclude Columns

We wish to apply a data aggregation operation to all but a set of excluded columns.

In this example, we have a data frame df and for each group, we wish to count the number of unique values of all columns except col_2 and col_4, where the groups are defined by the values of the column col_1.

selected_cols = \
    df.columns.difference(['col_2', 'col_4'])
df_2 = (df
        .groupby('col_1')[selected_cols]
        .agg('nunique'))

Here is how this works:

  • In df.columns.difference(['col_3', 'col_4']) we identify the columns we wish to exclude by name. See Exclude Columns for a coverage of column exclusion scenarios.
  • The rest of the code works as described in Explicit Selection above.
  • The output is a new data frame df_2, with one row for each unique value in the column col_1, and a column for each column in the original data frame df excluding col_2 and col_4 (and col_1) containing the number of unique values in that column for the corresponding group of rows.
PYTHON
I/O