Common Transformation Scenarios

The most common data transformations involve applying a vectorized function (i.e. one that operates on an entire column at a time) to one or more columns of a data frame to output a new column.

We will cover three common scenarios for creating new columns:

  • Scalars where we create a new column entirely composed of the same value. This may be a starting point of a larger transformation or to identify rows of a data frame before appending rows from multiple data frames.
  • Operations where we create a new column by applying arithmetic or logical operations on existing columns. For instance, creating a new column that is the ratio of two existing columns.
  • Functions where we create a new column by applying one or more functions to existing columns. For instance, creating a new column that is the log of an existing column.

Scalars

We wish to create a new column that is entirely composed of the same value.

In this example, we create a new column col_3 that has the value 2022 for all rows of the data frame df.

df_2 = df\
  .assign(col_3 = 2022)

Here is how this works:

  • The preferred approach to data transformation in Pandas is via the data frame method assign().
  • assign() acts on a data frame, which in this case is df, and expects a data transformation expression, which in this case is col_3 = 2022.
  • Let’s take a closer look at the data transformation expression col_3 = 2022:
    • col_3 is the name of the output column unquoted (without any quotes).
    • = is the assignment operator assigning the output of the expression on the right side to the variable on the left side.
    • 2022 is the data transformation to be carried out which here is simply to assign the same scaler value to each row.
  • The output of assign() is a copy of the data frame with the new column(s) added or with existing data frames overwritten if the transformed column name is the same as any existing column names.
  • We can assign a scaler value of any data type including numeric, string, logical, categorical, and datetime.

Alternatively:

df['col_3'] = 2022

Here is how this works:

  • df['col_3'] will add a new column col_3 to the data frame df or overwrite an existing column if a column named col_3 already exists.
  • Each value of the column df['col_3'] will be set to the scaler value 2022.
  • Note that this modifies the original data frame and is not appropriate when we wish to chain multiple data transformation operations.
  • assign() is the preferred approach to data manipulation because it doesn't modify the original data frame and can be used as part of a chain of data manipulation operations.

Operations

We wish to create a new column by applying arithmetic or logical operations on existing columns.

In this example, we create a new column col_3 that is the ratio of two existing columns col_1 and col_2.

df_2 = df\
  .assign(
     col_3 = df['col_1'] / df['col_2']
   )

Here is how this works:

  • col_3: The name of the output column that will be created (or overwritten if a column by the same name exists in the data frame) which is specified unquoted (without any quotes).
  • df['col_1'] / df['col_2']: The data transformation to be carried out which is to divide the value of the column col_1 for each row by the value of the column col_2 of the same row. Unlike the output column name, the input column name(s) must be referred to through the data frame name and must be quoted.
  • assign() carries out the specified transformation on the given data frame df. See the “Scalars” scenario above for more on assign().
  • While there are operations that work for different data types, most often it is arithmetic operations (such as +, -, *, and /) or comparison operations (such as >, <, and ==) applied to numeric data which we cover in Numeric Operations.

Alternatively:

df['col_3'] = df['col_1'] / df['col_2']

Here is how this works:

  • df['col_3'] will add a new column col_3 to the data frame df or overwrite an existing column if a column named col_3 already exists.
  • The output of the expression df['col_1'] / df['col_2'] will be assigned to the column df['col_3'].
  • Note that this modifies the original data frame and is not appropriate when we wish to chain multiple data transformation operations.
  • assign() is the preferred approach to data manipulation because it doesn't modify the original data frame and can be used as part of a chain of data manipulation operations.

Functions

Built-In Functions

We wish to create a new column by applying one or more built-in functions on existing columns.

In this example, we create a new column col_3 that is the log of an existing column col_1.

df_2 = df\
  .assign(
     col_3 = np.log10(df['col_1'])
   )

Here is how this works:

  • col_3 is the name of the output column that will be created (or overwritten if a column by the same name exists in the data frame)
  • np.log10(x['col_1']) is the data transformation to be carried out which is to apply the NumPy function np.log10() to take the log base 10 of each value of the column col_1.
  • assign() carries out the specified transformation on the given data frame df. See the “Scalars” scenario above for more on assign().
  • We cover functions for manipulating different data types in their respective operations sections; Numeric , String , Logical , Factor, and Datetime Operations.

Alternatively:

df['col_3'] = np.log10(df['col_1'])

Here is how this works:

See the description of the alternative solution under “Operations” above.

Custom Functions

We wish to create a new column by applying one or more custom functions on existing columns.

In this example, we create a new column col_3 via a custom function that accepts a numerical column and returns a column of the same size that is the ratio of each value of the input column to the sum of all values of that column.

def m_fun(var):
    var_2 = var / sum(var)
    return var_2

df_2 = df\
    .assign(col_3 = m_fun(df['col_1']))

Here is how this works:

  • We create a custom function m_fun() which accepts a column passed to its var argument.
  • In var / sum(var), inside the custom function, we compute the ratio of each value of the input column to the sum of all values of the input column.
  • We can call the custom function as part of a data transformation expression inside assign() just like we would a built-in function (see “Built-In Functions” above).

Alternatively:

def m_fun(var):
    var_2 = var / sum(var)
    return var_2


df['col_3'] = m_fun(df['col_1'])

Here is how this works:

See the description of the alternative solution under “Operations” above.

PYTHON
I/O