We wish to carry out data transformation operations conditionally. In other words, the data transformation operation we wish to carry out is not the same for all rows, rather it is different based on one or more conditions.
We will cover two scenarios:
We wish to perform a data transformation operation that involves one condition.
In this example, we wish to create a new numerical column col_3
whose value is a function of the value of the numerical column col_1
. The value of col_3
should be 50% higher than the corresponding value of col_1
if the value of col_1
is greater than 10. Otherwise, the value of col_3
should be just 20% higher than the corresponding value of col_1
.
df_2 = df\
.assign(
col_3 = lambda x:
np.where(x['col_1'] > 10,
x['col_1'] * 1.5,
x['col_1'] * 1.2))
Here is how this works:
where()
from NumPy
inside assign()
.lambda
function, so we may refer to columns created earlier in the same assign()
statement (more precisely; in the same data manipulation chain). Should that not be a concern, we can drop the lambda
and refer to columns directly (see the first alternative solution below).where()
three arguments as follows:x['col_1'] > 10
.True
, which in this example is x['col_1'] * 1.5
.False
, which in this example is x['col_1'] * 1.2
.col_3
will take the value produced by x['col_1'] * 1.5
for rows where the condition x['col_1'] > 10
is True
, and the values produced by x['col_1'] * 1.2
elsewhere.where()
and other conditional statements.Alternatively:
df['col_3'] = np.where(df['col_1'] > 10,
df['col_1'] * 1.5,
df['col_1'] * 1.2))
Here is how this works:
df['col_3']
will add a new column col_3
to the data frame df
or overwrite an existing column if a column named col_3
already exists.where()
expression (see description above) will be assigned to the column df['col_3']
.assign()
is the preferred approach to data manipulation because it doesn't modify the original data frame. In addition, assign()
can be used as part of a chain of data manipulation operations.df.loc[df['col_1'] > 10, 'col_3'] = df['col_1'] * 1.5
df.loc[df['col_1'] < 10, 'col_3'] = df['col_1'] * 1.2
Here is how this works:
loc[]
.loc[]
in two steps.loc[]
specifies the column that we wish to create (or overwrite) which here is 'col_3'
. When assigning to loc[]
, we can specify columns that do not exist in the current data frame.loc[]
is the True
condition df['col_1'] > 10
and to that we assign the data transformation expression df['col_1'] * 1.5
.loc[]
is the False
condition df['col_1'] < 10
and to that we assign the data transformation expression df['col_1'] * 1.2
.df['col_3'] = df['col_1']\
.apply(lambda x: x * 1.5 if x > 10 else x * 1.2)
Here is how this works:
apply()
applies a lambda
function to each value of col_1
.lambda
function we use a one line if-else statement (which is called a Python ternary operator).col_1 * 1.5
if col_1 > 10
otherwise returns col_1 * 1.2
.We wish to perform a data transformation operation that involves multiple conditions.
In this example, we wish to create a new numerical column col_3
whose value is a function of the values of the columns col_1
and col_2
and where the function that maps col_1
and col_2
to col_3
is different for different ranges of values of col_1
and col_2
.
df_2 = df\
.assign(
col_3 = lambda x: np.select(
[(x['col_1'] > 20) & (x['col_2'] > 2),
(x['col_1'] > 15) & (x['col_2'] > 1),
(x['col_1'] > 10) & (x['col_2'] > 0)],
[x['col_1'] * 1.5,
x['col_1'] * 1.2,
x['col_1'] * 1.1],
x['col_1']))
Here is how this works:
select()
from NumPy
inside assign()
.where()
(or any of the other one condition solutions shown under “One Condition” above) to perform a conditional data transformation operation that involves multiple conditions. The recommended approach is to use select()
.select()
expects the followingTrue
lambda
function, so we may refer to columns created earlier in the same assign()
statement. Should that not be a concern, we can drop the lambda
and refer to columns directly (see the alternative solution below).select()
and other conditional statements.Alternatively:
df['col_3'] = np.select(
[(df['col_1'] > 20) & (df['col_2'] > 2),
(df['col_1'] > 15) & (df['col_2'] > 1),
(df['col_1'] > 10) & (df['col_2'] > 0)],
[df['col_1'] * 1.5,
df['col_1'] * 1.2,
df['col_1'] * 1.1],
df['col_1'])
Here is how this works:
df
, we can assign the output of the select()
function directly to df[’col_3’]
.