Non-Vectorized Transformation

Most data transformation involves operating on columns with vectorized functions; i.e. functions that accept a vector, perform an operation on each element of that vector, and return a vector of the same size as the input vector eliminating the need for a loop. There are times though when we need to operate on rows in a non-vectorized manner e.g. if we wish to obtain the mean value of some columns for each row.

In this example, we have a data frame df with two numerical columns ‘col_1' and ‘col_2' and we wish to create a new column ‘col_3' where each value is the mean of the values of the columns ‘col_1' and ‘col_2' for the same row. We also wish to create a column ‘col_4' where each value is a string of the format "col_1 | col_2".

df_2 = df \
  .assign(
    col_3 = df.apply(
      lambda x: np.mean([x['col_1'], x['col_2']]), axis=1),
    col_4 = df.apply(
      lambda x: '{} | {}'.format(x['col_1'], x['col_2']), axis=1))

Here is how this works:

  • To operate in a non-vectorized manner, i.e. to operate on rows, we use apply() and set axis=1.
  • The apply() function executes a function along an axis of the data frame; columns or rows:
    • axis=0 (the default) instructs apply() to operate on columns; one at a time.
    • axis=1 instructs apply to operate on rows; one at a time.
  • We use lambda functions to which apply() passes the rows of the data frame df one at a time. In the body of the lambda function, we can refer to the value of any column for that row e.g. x[’col_1’].
  • In lambda x: np.mean([x['col_1'], x['col_2']]), the mean value of the values of the columns ‘col_1' and ‘col_2' for each row is calculated.
  • In lambda x: '{} | {}'.format(x['col_1'], x['col_2']), a string is constructed where the values of of the columns ‘col_1' and ‘col_2' are inserted in the template ‘{} | {}’. See String Operations for more on format().
  • Depending on the signature of the function we wish to call, we may need to wrap the inputs in a list [] or need to pass them to individual function arguments. For instance, here we wrap the column value references in [] for np.mean() but pass the columns directly to arguemnts of format() because of their signatures:
    • The signature of np.mean() is np.mean(a, ...) i.e. it expects, as its first argument, a single vector like object holding the numerical values to be averaged.
    • While the signature of format() is format(self, *args, **kwargs) where the *args accepts any number of individual values passed as positional parameters.

Alternatively:

df_2 = df\
  .assign(
    col_3 = df[['col_1', 'col_2']].apply(np.mean, axis=1),
    col_4 = df[['col_1', 'col_2']].apply(
      lambda x: '{} | {}'.format(x['col_1'], x['col_2']), axis=1))

Here is how this works:

  • In df[['col_1', 'col_2']], we use the bracket operator [] to subset the input data frame df into a data frame that only includes the columns that we wish to make available to the function inside apply() .
  • We then call apply() on the resulting sub data frame while setting axis=1 so apply acts on rows one at a time.
  • As described above, the way we structure the function call inside apply() depends on the function’s signature:
    • In apply(np.mean, axis=1), each row is passed (as a Series) to the first argument of the function mean(). Since mean() expects an array like object, this works and the mean for each row is returned as desired.
    • In apply(lambda x: '{} | {}'.format(x['col_1'], x['col_2']), axis=1), we spell out the columns instead of letting apply() pass the entire row to the first argument of format() because, as described above, format expects individual values passed to its arguments.

Should we not wish to use assign(), we could:

df_2 = df.copy()
df_2['col_3'] = df_2[['col_1', 'col_2']].apply(np.mean, axis=1)
df_2['col_4'] = df_2[['col_1', 'col_2']].apply(
    lambda x: '{} | {}'.format(x['col_1'], x['col_2']), axis=1)

Here is how this works:

  • We frist create a deep copy of the data frame df via df.copy().
  • We then carry out the row wise data transformation operations as described above.
  • assign() is the preferred approach to data manipulation because it handles copying the original data frame and can be used as part of a chain of data manipulation operations.

And since, in this example, both operations act on the same columns 'col_1' and 'col_2', we could:

df_2 = df.copy()
df_2[['col_3', 'col_4']] = df_2[['col_1', 'col_2']]\
    .apply([np.mean,
            lambda x: '{} | {}'.format(*x)],
           axis=1)

Here is how this works:

  • We can pass multiple functions to apply() each of which will be carried out on each row of the calling data frame. See Implicit Transformation
  • Since the columns in the calling data frame df_2[['col_3', 'col_4']] are in the same order of the arguments of format(), we can use Python's list unpack feature to spare us some typing. What happens is *x unpacks x and passes the two individual elements (the values of ‘col_1' and ‘col_2’ for the current row) to the function as its two arguments.
PYTHON
I/O