Combining

In this section, we cover the five common string combining scenarios that we come across in data manipulation contexts:

  1. Concatenating: Append two or more strings to each other.
  2. Collapsing: Concatenate all elements of a string vector into one string. A common application is to aggregate the values of a string column into a single string.
  3. Interpolating: Insert the value of one or more string variables into a string template.
  4. Repeating: Concatenate a string with itself a given number of times.
  5. Padding a string with a given character up to a given width.

Concatenating

We wish to concatenate the values of one column to the corresponding values of another column.

In this example, we wish to concatenate the values of the column col_1 to the corresponding values of the column col_2 with a comma as a separator into a new column col_3.

df_2 = df.assign(
  col_3 = df['col_1'].str.cat(df['col_2'], sep=', '))

Here is how this works:

  • We use the str.cat() method from the str accessor set of string manipulation methods of Pandas Series to concatenate the values in the column col_1 with the corresponding values in the column col_2 with a separator ', ‘ in between.
  • The str.cat() method takes two arguments:
    • The first is the other column or columns to be concatenated with the column on which the method is called (in this case, col_2)
    • A named argument sep that specifies the separator to insert between the values in the concatenated columns; which in this case is sep = ', ‘.
  • The output is the original data frame df with an additional columncol_3 holding the concatenation of col_1 and col_2 with a comma and space (, ) as a separator.

Extension: Concatenate Fixed String

We wish to concatenate a fixed string to each value of a column of a data frame.

In this example, we wish to append a question mark to each value in the column col_1.

df_2 = df.assign(
  col_2 = df['col_1'] + '?')

Here is how this works:

  • We use the core Python string concatenation operator '+' to concatenate a fixes string to each value of the column col_1.
  • The output is the original data frame df with an additional columncol_2 holding the concatenation of '?' to each element of the column col_1.

Extension: More than 2 Columns

We wish to concatenate the values of multiple columns; i.e. concatenate the values of those columns for each row.

In this example, we wish to concatenate the values of the column col_1 to the corresponding values of the column col_2 and col_3 with a comma as a separator into a new column col_4.

df_2 = df.assign(
    col_4 = df['col_1'].str.cat(
        [df['col_2'], df['col_3']], 
        sep=', '))

Here is how this works:

  • This works similarly to the code in the primary solution above except that, since we wish to concatenate the values of more than two columns, we pass to str.cat() a list of the columns whose values we wish to concatenate; which in this case is [df['col_2'], df['col_3'].
  • The output is the original data frame df with an additional columncol_4 holding the concatenation of col_1, col_2 and col_3 with a comma and space , as a separator.

Extension: Handling Missing Values

We wish to concatenate the values of two columns and substitute missing values with a given token (string).

In this example, we wish to concatenate the values of the column col_1 to the corresponding values of the column col_2 with a comma as a separator into a new column col_3. We wish to substitute missing values with ‘-’.

df_2 = df.assign(
    col_3 = df['col_1'].str.cat(
        df['col_2'],
        sep=', ',
        na_rep='-'))

Here is how this works:

  • If we run str.cat() on columns that have missing values, any concatenation operation where any of the values being concatenated is a missing value would have returned a missing value as the result of the operation.
  • One way to deal with this it to replace missing values with some particular string (which can be an empty string). In this example, we wish to use ‘-’ as a replacement for missing values.
  • Conveniently, we can pass to str.cat() an additional parameter to its na_rep argument specifying the string to use wherever any of the values being concatenated is a missing value. In this case, we set na_rep='-'.
  • See Missing Values for a general coverage of working with missing values.

Collapsing

We wish to concatenate all values of a string column into one string with an optional separator.

In this example, we wish to concatenate the values of the column col_2 into a single string for each group where the groups are defined by the values of the column col_1.

df_2 = (df
    .groupby('col_1')
    .agg(col_3=('col_2', lambda x: x.str.cat(sep=', '))))

Here is how this works:

  • The groupby() method groups the rows of the data frame df by the values in the column col_1. The data aggregation operations applied by the subsequent call to agg() are, therefore, applied to each group separately. See Summary Table.
  • If we do not pass any additional columns to str.cat(), it will collapse the values of the column it is called on into a single string literal.
  • In this case, we collapse the values of the column col_2 for each group into a single string literal with ', ' as a separator.
  • We use a lambda function, so we may call the str.cat() method on the values of the col_2 for each group from within agg().
  • Note: This df.groupby('col_1')['col_2'].str.cat(sep=', ') does not work because str.cat() is not defined for SeriesGroupBy objects. Hence, our use of a lambda function inside agg() to act on the sub-Series that is the values of col_2 for the current group.
  • For dealing with missing values, see “Extension: Handling Missing Values” under Concatenating above.

Extension: Collapse Entire Column

col_2_str = df['col_2'].str.cat(sep=', ')

Here is how this works:

  • As described above, if we do not pass any additional columns to str.cat(), it will collapse the values of the column it is called on into a single string literal.
  • In this example, we reduce the entire column col_2 of the data frame df into a single string literal col_2_str.

Interpolating

We wish to create a new string column by inserting the value of one or more string variables into a string template.

In this example, we wish to create a new column col_3 from the values of the columns col_1 and col_2 where each value follows the template '{col_1} vs {col_2}' for the corresponding rows.

df_2 = df.assign(
    col_3 = df.apply(
        lambda x: '{} vs {}'.format(x['col_1'], x['col_2']),
        axis=1))

Here is how this works:

  • There is no built-in Pandas Series method for interpolating into a string template. Therefore, we use the Python format() function.
  • The string format() function is not vectorized, therefore we need to use apply() with axis=1 to apply format() in a row wise manner to each row of the data frame df. See Non-Vectorized Transformation.
  • The lambda function passed to apply() takes a row of the data frame (represented by the variable x) and uses the python function format() to insert the values of the columns col_1 and col_2 for the current row into the template '{} vs {}'.
  • The format() function allows us to include placeholders denoted by {} in a string, which are replaced with the values passed to the format() function. format() can also work with named placeholders such as {col} which are replaced with the corresponding keyword argument passed to the format() function.

Repeating

We wish to create a new string column by repeating the values in an existing string column a set number of times.

In this example, we wish to create a new column col_3 where each value is the repeating (concatenation to itself) of the corresponding value in the col_2 3 times.

df_2 = df.assign(
  col_3 = df['col_2'].str.repeat(3))

Here is how this works:

  • We use the str.repeat() method, from the str accessor set of string manipulation methods of Pandas Series, to create a new Series where each value in the original Series is repeated (concatenated to itself) a particular number of times.
  • The str.repeat() method takes an integer argument, which in this case is 3, specifying the number of times each value of the calling string column will be repeated.
  • The output is the original data frame df with an additional columncol_3 holding the values from the column col_2 each repeated 3 times.

Padding

We wish to pad a string with a specified character for up to a given width.

In this example, we wish to create a new column col_3 that is a padding of the values of the column col_2 to get each to 4 characters.

df_2 = df.assign(
    col_3 = df['col_2'].str.pad(
        width=4,
        side='left',
        fillchar='*'))

Here is how this works:

  • We use the str.pad() method, from the str accessor set of string manipulation methods of Pandas Series, to pad the values of the column col_2 by adding the character * to the left so each string value is 4 characters long.
  • The str.pad() method takes three arguments:
    • width which is the desired length of the output padded string and which in this case is 4.
    • side is the side of the string on which the padding should be applied and takes one of three values 'left', 'right', and 'both' and which in this case is left.
    • fillchar which specified the character to use for padding the input string to match the given width and which in this case is '*'.
  • The output is the original data frame df with an additional columncol_3 holding the values from the column col_2 with padding applied to the left side of each value.
PYTHON
I/O