The most common data transformations involve applying a vectorized function (i.e. one that operates on an entire column at a time) to one or more columns of a data frame to output a new column.
We will cover three common scenarios for creating new columns:
We wish to create a new column that is entirely composed of the same value.
In this example, we create a new column col_3
that has the value 2022 for all rows of the data frame df
.
df_2 = df\
.assign(col_3 = 2022)
Here is how this works:
assign()
.assign()
acts on a data frame, which in this case is df
, and expects a data transformation
expression, which in this case is col_3 = 2022
.col_3 = 2022
:col_3
is the name of the output column unquoted (without any quotes).=
is the assignment operator assigning the output of the expression on the right side to
the variable on the left side.2022
is the data transformation to be carried out which here is simply to assign the same
scaler value to each row.assign()
is a copy of the data frame with the new column(s) added or with existing
data frames overwritten if the transformed column name is the same as any existing column names.Alternatively:
df['col_3'] = 2022
Here is how this works:
df['col_3']
will add a new column col_3
to the data frame df
or overwrite an existing column
if a column named col_3
already exists.df['col_3']
will be set to the scaler value 2022.assign()
is the preferred approach to data manipulation because it doesn't modify the original
data frame and can be used as part of a chain of data manipulation operations.We wish to create a new column by applying arithmetic or logical operations on existing columns.
In this example, we create a new column col_3
that is the ratio of two existing columns col_1
and col_2
.
df_2 = df\
.assign(
col_3 = df['col_1'] / df['col_2']
)
Here is how this works:
col_3
: The name of the output column that will be created (or overwritten if a column by the
same name exists in the data frame) which is specified unquoted (without any quotes).df['col_1'] / df['col_2']
: The data transformation to be carried out which is to divide the
value of the column col_1
for each row by the value of the column col_2
of the same row.
Unlike the output column name, the input column name(s) must be referred to through the data frame
name and must be quoted.assign()
carries out the specified transformation on the given data frame df
. See the
“Scalars” scenario above for more on assign()
.+
, -
, *
, and /
) or comparison operations (such as >
, <
, and ==
)
applied to numeric data which we cover
in Numeric Operations.Alternatively:
df['col_3'] = df['col_1'] / df['col_2']
Here is how this works:
df['col_3']
will add a new column col_3
to the data frame df
or overwrite an existing column
if a column named col_3
already exists.df['col_1'] / df['col_2']
will be assigned to the
column df['col_3']
.assign()
is the preferred approach to data manipulation because it doesn't modify the original
data frame and can be used as part of a chain of data manipulation operations.Built-In Functions
We wish to create a new column by applying one or more built-in functions on existing columns.
In this example, we create a new column col_3
that is the log of an existing column col_1
.
df_2 = df\
.assign(
col_3 = np.log10(df['col_1'])
)
Here is how this works:
col_3
is the name of the output column that will be created (or overwritten if a column by the
same name exists in the data frame)np.log10(x['col_1'])
is the data transformation to be carried out which is to apply the NumPy
function np.log10()
to take the log base 10 of each value of the column col_1
.assign()
carries out the specified transformation on the given data frame df
. See the
“Scalars” scenario above for more on assign()
.Alternatively:
df['col_3'] = np.log10(df['col_1'])
Here is how this works:
See the description of the alternative solution under “Operations” above.
Custom Functions
We wish to create a new column by applying one or more custom functions on existing columns.
In this example, we create a new column col_3
via a custom function that accepts a numerical
column and returns a column of the same size that is the ratio of each value of the input column to
the sum of all values of that column.
def m_fun(var):
var_2 = var / sum(var)
return var_2
df_2 = df\
.assign(col_3 = m_fun(df['col_1']))
Here is how this works:
m_fun()
which accepts a column passed to its var
argument.var / sum(var)
, inside the custom function, we compute the ratio of each value of the input
column to the sum of all values of the input column.assign()
just
like we would a built-in function (see “Built-In Functions” above).Alternatively:
def m_fun(var):
var_2 = var / sum(var)
return var_2
df['col_3'] = m_fun(df['col_1'])
Here is how this works:
See the description of the alternative solution under “Operations” above.