Column Selection for Implicit Renaming

In Function Specification for Implicit Renaming, we cover how to specify the functions to be applied to the current column names to generate the desired names. In this section, we show how to select the column(s) to be renamed when we only wish to rename a subset of columns.

We will cover the following scenarios:

  • All Columns where we cover how to apply a function to rename all columns.
  • Explicit Selection where we cover how to apply a function to rename each of a set of explicitly selected columns of a data frame e.g. selecting columns by spelling out their names.
  • Implicit Selection where we cover how to apply a function to rename each of a set of implicitly selected columns of a data frame e.g. selecting columns whose names contain a certain substring.
  • Exclude Columns where we cover how to apply a function to rename each column of a data frame except for a set of excluded columns.

For more detailed coverage of column selection, see Selection.

All Columns

We wish to some columns of a data frame by applying a function to their current names to generate the desired names. We wish to select the columns to be renamed via their names.

In this example, we wish to convert the names of all columns to lowercase.

df_2 = df.rename(columns=str.lower)

Here is how this works:

  • The columns argument of rename() accepts a function to which the existing names of all columns are passed one by one.
  • We pass to rename() the name of the function that we wish to apply to each column name, which here is the core Python function str.lower().

Explicit Selection

We wish to rename some columns of a data frame by applying a function to their current names to generate the desired names.

In this example, we wish to rename the columns in positions 1 and 2 by lowering the case of their current names and replacing dash separators ‘-’ with underscores ‘_’.

selected_cols = df.columns[[1, 2]]
renamed_cols = selected_cols.str.lower().str.replace('-', '_')
df_2 = df\
    .rename(columns=dict(zip(selected_cols, renamed_cols)))

Here is how this works:

  • The approach we adopt here is as follows:
  • Obtain the names of the selected columns as a Series of strings, which we do here via df.columns[[1, 2]]. See Basic Selection.
  • Apply the renaming logic to that Series of strings, which we do here via the functions str.lower() and str.replace(). See Function Specification for Implicit Renaming and String Operations.
  • Use rename() to map from the original names (step 1) to the modified names (step 2). See Map Names.
  • We use the expression dict(zip(selected_cols, renamed_cols)) to convert the two lists into a dictionary of key-value pairs where the keys are the current column names and the values are the desired column names. This is the format that rename() expects.
  • The output is a data frame df_2 is a copy of the input data frame with the renaming logic applied only to the selected columns.

Extension: Custom Function

def m_rename(df, select_fn, rename_fn):
    selected_cols = select_fn(df)
    renamed_cols = rename_fn(selected_cols)
    return df.rename(columns=dict(zip(selected_cols, renamed_cols)))

df_2 = df \
    .pipe(m_rename,
          lambda x: x.columns[[1, 2]],
          lambda x: x.str.lower().str.replace('-', '_'))

Here is how this works:

  • In case we need to run this multiple times, a good idea is to wrap the renaming logic into a function and use the Pandas pipe() method to rename selected columns implicitly in a chained manner.
  • The custom function m_rename() expects the following:
    • df A data frame whose columns (or some of them) are to be renamed.
    • select_fn A function or lambda function that can be applied to the data frame to obtain the names of the columns to be renamed as strings.
    • rename_fn A function or lambda function that can be applied to the column names returned by select_fn to obtain the desired column names.
  • Renaming works as described in the primary solution above.

Implicit Selection

We wish to rename a subset of the columns of the data frame. We wish to select that subset of columns implicitly; i.e. we do not spell out the column names or positions explicitly but rather identify the columns via a property of their name or their data.

In this example, we wish to replace the suffix ‘_num’ with the suffix ‘_int’ for all columns whose data type is integer.

selected_cols = df.select_dtypes('integer').columns
renamed_cols = selected_cols.str.replace('_num', '_int')
df_2 = df.rename(columns=dict(zip(selected_cols, renamed_cols)))

Here is how this works:

  • We follow the same approach described in Explicit Selection above.
  • In df.select_dtypes('integer').columns, we obtain the names of the columns whose data type is integer. See Implicit Selection for a coverage of the most common scenarios of implicit column selection including by name pattern, data type, and Criteria satisfied by the column’s data.

Extension: Custom Function

def m_rename(df, select_fn, rename_fn):
    selected_cols = select_fn(df)
    renamed_cols = rename_fn(selected_cols)
    return df.rename(columns=dict(zip(selected_cols, renamed_cols)))

df_2 = df \
    .pipe(m_rename,
          lambda x: x.select_dtypes('integer').columns,
          lambda x: x.str.replace('_num', '_int'))

Here is how this works:

See “Extension: Custom Function” under Explicit Selection above.

Exclude Columns

We wish to apply a function to rename all but a set of columns.

In this example, we wish to add the prefix ‘attr_’ to all columns except columns whose current name includes the string ‘_id’.

selected_cols = df.columns[~df.columns.str.contains('_id')]
renamed_cols = selected_cols.map('attr_{}'.format)
df_2 = df.rename(columns=dict(zip(selected_cols, renamed_cols)))

Here is how this works:

  • We follow the same approach described in Explicit Selection above.
  • In df.columns[~df.columns.str.contains('_id')], we obtain the names of all columns except those whose name contains the substring ‘_id’. See Exclude Columns for coverage of column exclusion scenarios, all of which can be used for implicit renaming.

Extension: Custom Function

def m_rename(df, select_fn, rename_fn):
    selected_cols = select_fn(df)
    renamed_cols = rename_fn(selected_cols)
    return df.rename(columns=dict(zip(selected_cols, renamed_cols)))

df_2 = df \
    .pipe(m_rename,
          lambda x: x.columns[~x.columns.str.contains('_id')],
          lambda x: x.map('attr_{}'.format))

Here is how this works:

See “Extension: Custom Function” under Explicit Selection above.

PYTHON
I/O