Common Transformation Scenarios

The most common data transformations involve applying a vectorized function (i.e. one that operates on an entire column at a time) to one or more columns of a data frame to output a new column.

We will cover three common scenarios for creating new columns:

Scalars where we create a new column entirely composed of the same value. This may be a starting point of a larger transformation or to identify rows of a data frame before appending rows from multiple data frames.
Operations where we create a new column by applying arithmetic or logical operations on existing columns. For instance, creating a new column that is the ratio of two existing columns.
Functions where we create a new column by applying one or more functions to existing columns. For instance, creating a new column that is the log of an existing column.

Scalars

We wish to create a new column that is entirely composed of the same value.

In this example, we create a new column col_3 that has the value 2022 for all rows of the data frame df.

df_2 = df\
  .assign(col_3 = 2022)

Here is how this works:

The preferred approach to data transformation in Pandas is via the data frame method assign().
assign() acts on a data frame, which in this case is df, and expects a data transformation expression, which in this case is col_3 = 2022.
Let’s take a closer look at the data transformation expression col_3 = 2022:
- col_3 is the name of the output column unquoted (without any quotes).
- = is the assignment operator assigning the output of the expression on the right side to the variable on the left side.
- 2022 is the data transformation to be carried out which here is simply to assign the same scaler value to each row.
The output of assign() is a copy of the data frame with the new column(s) added or with existing data frames overwritten if the transformed column name is the same as any existing column names.
We can assign a scaler value of any data type including numeric, string, logical, categorical, and datetime.

Alternatively:

df['col_3'] = 2022

Here is how this works:

df['col_3'] will add a new column col_3 to the data frame df or overwrite an existing column if a column named col_3 already exists.
Each value of the column df['col_3'] will be set to the scaler value 2022.
Note that this modifies the original data frame and is not appropriate when we wish to chain multiple data transformation operations.
assign() is the preferred approach to data manipulation because it doesn't modify the original data frame and can be used as part of a chain of data manipulation operations.

Operations

We wish to create a new column by applying arithmetic or logical operations on existing columns.

In this example, we create a new column col_3 that is the ratio of two existing columns col_1 and col_2.

df_2 = df\
  .assign(
     col_3 = df['col_1'] / df['col_2']
   )

Here is how this works:

col_3: The name of the output column that will be created (or overwritten if a column by the same name exists in the data frame) which is specified unquoted (without any quotes).
df['col_1'] / df['col_2']: The data transformation to be carried out which is to divide the value of the column col_1 for each row by the value of the column col_2 of the same row. Unlike the output column name, the input column name(s) must be referred to through the data frame name and must be quoted.
assign() carries out the specified transformation on the given data frame df. See the “Scalars” scenario above for more on assign().
While there are operations that work for different data types, most often it is arithmetic operations (such as +, -, *, and /) or comparison operations (such as >, <, and ==) applied to numeric data which we cover in Numeric Operations.

Alternatively:

df['col_3'] = df['col_1'] / df['col_2']

Here is how this works:

df['col_3'] will add a new column col_3 to the data frame df or overwrite an existing column if a column named col_3 already exists.
The output of the expression df['col_1'] / df['col_2'] will be assigned to the column df['col_3'].
Note that this modifies the original data frame and is not appropriate when we wish to chain multiple data transformation operations.
assign() is the preferred approach to data manipulation because it doesn't modify the original data frame and can be used as part of a chain of data manipulation operations.

Functions

Built-In Functions

We wish to create a new column by applying one or more built-in functions on existing columns.

In this example, we create a new column col_3 that is the log of an existing column col_1.

df_2 = df\
  .assign(
     col_3 = np.log10(df['col_1'])
   )

Here is how this works:

col_3 is the name of the output column that will be created (or overwritten if a column by the same name exists in the data frame)
np.log10(x['col_1']) is the data transformation to be carried out which is to apply the NumPy function np.log10() to take the log base 10 of each value of the column col_1.
assign() carries out the specified transformation on the given data frame df. See the “Scalars” scenario above for more on assign().
We cover functions for manipulating different data types in their respective operations sections; Numeric , String , Logical , Factor, and Datetime Operations.

Alternatively:

df['col_3'] = np.log10(df['col_1'])

Here is how this works:

See the description of the alternative solution under “Operations” above.

Custom Functions

We wish to create a new column by applying one or more custom functions on existing columns.

In this example, we create a new column col_3 via a custom function that accepts a numerical column and returns a column of the same size that is the ratio of each value of the input column to the sum of all values of that column.

def m_fun(var):
    var_2 = var / sum(var)
    return var_2

df_2 = df\
    .assign(col_3 = m_fun(df['col_1']))

Here is how this works:

We create a custom function m_fun() which accepts a column passed to its var argument.
In var / sum(var), inside the custom function, we compute the ratio of each value of the input column to the sum of all values of the input column.
We can call the custom function as part of a data transformation expression inside assign() just like we would a built-in function (see “Built-In Functions” above).

Alternatively:

def m_fun(var):
    var_2 = var / sum(var)
    return var_2


df['col_3'] = m_fun(df['col_1'])

Here is how this works:

See the description of the alternative solution under “Operations” above.

Optima.io Reference beta

Common Transformation Scenarios

Scalars

Operations

Functions