Conditional Transformation

We wish to carry out data transformation operations conditionally. In other words, the data transformation operation we wish to carry out is not the same for all rows, rather it is different based on one or more conditions.

We will cover two scenarios:

  • One Condition where we cover conditional transformation scenarios involving one condition.
  • Multiple Conditions where we cover conditional transformation scenarios involving multiple conditions.

One Condition

We wish to perform a data transformation operation that involves one condition.

In this example, we wish to create a new numerical column col_3 whose value is a function of the value of the numerical column col_1. The value of col_3 should be 50% higher than the corresponding value of col_1 if the value of col_1 is greater than 10. Otherwise, the value of col_3 should be just 20% higher than the corresponding value of col_1.

df_2 = df\
    .assign(
        col_3 = lambda x:
            np.where(x['col_1'] > 10,
                     x['col_1'] * 1.5,
                     x['col_1'] * 1.2))

Here is how this works:

  • In order to carry out a data transformation operation in a conditional manner, we can use where() from NumPy inside assign().
  • We use a lambda function, so we may refer to columns created earlier in the same assign() statement (more precisely; in the same data manipulation chain). Should that not be a concern, we can drop the lambda and refer to columns directly (see the first alternative solution below).
  • Most commonly, we pass to where() three arguments as follows:
    • A logical condition which in this example is x['col_1'] > 10.
    • A data transformation expression that provides the value to assign to rows for which the condition evaluates to True , which in this example is x['col_1'] * 1.5.
    • A data transformation expression that provides the value to assign to rows for which the condition evaluates to False , which in this example is x['col_1'] * 1.2.
  • To recap, the output column col_3 will take the value produced by x['col_1'] * 1.5 for rows where the condition x['col_1'] > 10 is True, and the values produced by x['col_1'] * 1.2 elsewhere.
  • See General Operations for a coverage of where() and other conditional statements.

Alternatively:

df['col_3'] = np.where(df['col_1'] > 10,
                       df['col_1'] * 1.5,
                       df['col_1'] * 1.2))

Here is how this works:

  • df['col_3'] will add a new column col_3 to the data frame df or overwrite an existing column if a column named col_3 already exists.
  • The output of the where() expression (see description above) will be assigned to the column df['col_3'].
  • Note that this modifies the original data frame. assign() is the preferred approach to data manipulation because it doesn't modify the original data frame. In addition, assign() can be used as part of a chain of data manipulation operations.
df.loc[df['col_1'] > 10, 'col_3'] = df['col_1'] * 1.5
df.loc[df['col_1'] < 10, 'col_3'] = df['col_1'] * 1.2

Here is how this works:

  • In general, we can modify a subset of a data frame by assigning to loc[].
  • We perform a data transformation operation that involves one condition with loc[] in two steps.
  • In both steps, the second argument to loc[] specifies the column that we wish to create (or overwrite) which here is 'col_3'. When assigning to loc[], we can specify columns that do not exist in the current data frame.
  • In the first expression, the first argument to loc[] is the True condition df['col_1'] > 10 and to that we assign the data transformation expression df['col_1'] * 1.5.
  • In the second expression, the first argument to loc[] is the False condition df['col_1'] < 10 and to that we assign the data transformation expression df['col_1'] * 1.2.
df['col_3'] = df['col_1']\
    .apply(lambda x: x * 1.5 if x > 10 else x * 1.2)

Here is how this works:

  • apply() applies a lambda function to each value of col_1.
  • Inside the lambda function we use a one line if-else statement (which is called a Python ternary operator).
  • The if-else statement returns col_1 * 1.5 if col_1 > 10 otherwise returns col_1 * 1.2.

Multiple Conditions

We wish to perform a data transformation operation that involves multiple conditions.

In this example, we wish to create a new numerical column col_3 whose value is a function of the values of the columns col_1 and col_2 and where the function that maps col_1 and col_2 to col_3 is different for different ranges of values of col_1 and col_2.

df_2 = df\
    .assign(
        col_3 = lambda x: np.select(
            [(x['col_1'] > 20) & (x['col_2'] > 2),
             (x['col_1'] > 15) & (x['col_2'] > 1),
             (x['col_1'] > 10) & (x['col_2'] > 0)],
            [x['col_1'] * 1.5,
             x['col_1'] * 1.2,
             x['col_1'] * 1.1],
            x['col_1']))

Here is how this works:

  • In order to carry out a data transformation operation in a conditional manner involving multiple conditions, we can use select() from NumPy inside assign().
  • While possible, it can get quite messy to use where() (or any of the other one condition solutions shown under “One Condition” above) to perform a conditional data transformation operation that involves multiple conditions. The recommended approach is to use select().
  • select() expects the following
    • A list of conditions
    • A corresponding list of expressions to execute or values to return
    • A default expression to execute or value to return should none of the conditions evaluate to True
  • We use a lambda function, so we may refer to columns created earlier in the same assign() statement. Should that not be a concern, we can drop the lambda and refer to columns directly (see the alternative solution below).
  • See General Operations for a coverage of select() and other conditional statements.

Alternatively:

df['col_3'] = np.select(
    [(df['col_1'] > 20) & (df['col_2'] > 2),
     (df['col_1'] > 15) & (df['col_2'] > 1),
     (df['col_1'] > 10) & (df['col_2'] > 0)],
    [df['col_1'] * 1.5, 
     df['col_1'] * 1.2,
     df['col_1'] * 1.1],
    df['col_1'])

Here is how this works:

  • While not generally recommended, if we don’t mind modifying the original data frame df, we can assign the output of the select() function directly to df[’col_3’].
  • See the first alternative solution under “One Condition” above.
PYTHON
I/O