In this section, we cover the five common string combining scenarios that we come across in data manipulation contexts:
We wish to concatenate the values of one column to the corresponding values of another column.
In this example, we wish to concatenate the values of the column col_1
to the corresponding values of the column col_2
with a comma as a separator into a new column col_3
.
df_2 = df.assign(
col_3 = df['col_1'].str.cat(df['col_2'], sep=', '))
Here is how this works:
str.cat()
method from the str
accessor set of string manipulation methods of Pandas Series
to concatenate the values in the column col_1
with the corresponding values in the column col_2
with a separator ', ‘
in between.str.cat()
method takes two arguments:col_2
)sep
that specifies the separator to insert between the values in the concatenated columns; which in this case is sep = ', ‘
.df
with an additional columncol_3
holding the concatenation of col_1
and col_2
with a comma and space (,
) as a separator.Extension: Concatenate Fixed String
We wish to concatenate a fixed string to each value of a column of a data frame.
In this example, we wish to append a question mark to each value in the column col_1
.
df_2 = df.assign(
col_2 = df['col_1'] + '?')
Here is how this works:
df
with an additional columncol_2
holding the concatenation of '?'
to each element of the column col_1
.Extension: More than 2 Columns
We wish to concatenate the values of multiple columns; i.e. concatenate the values of those columns for each row.
In this example, we wish to concatenate the values of the column col_1
to the corresponding values of the column col_2
and col_3
with a comma as a separator into a new column col_4
.
df_2 = df.assign(
col_4 = df['col_1'].str.cat(
[df['col_2'], df['col_3']],
sep=', '))
Here is how this works:
str.cat()
a list of the columns whose values we wish to concatenate; which in this case is [df['col_2'], df['col_3']
.df
with an additional columncol_4
holding the concatenation of col_1,
col_2
and col_3
with a comma and space ,
as a separator.Extension: Handling Missing Values
We wish to concatenate the values of two columns and substitute missing values with a given token (string).
In this example, we wish to concatenate the values of the column col_1
to the corresponding values of the column col_2
with a comma as a separator into a new column col_3
. We wish to substitute missing values with ‘-’
.
df_2 = df.assign(
col_3 = df['col_1'].str.cat(
df['col_2'],
sep=', ',
na_rep='-'))
Here is how this works:
str.cat()
on columns that have missing values, any concatenation operation where any of the values being concatenated is a missing value would have returned a missing value as the result of the operation.‘-’
as a replacement for missing values.str.cat()
an additional parameter to its na_rep
argument specifying the string to use wherever any of the values being concatenated is a missing value. In this case, we set na_rep='-'
.We wish to concatenate all values of a string column into one string with an optional separator.
In this example, we wish to concatenate the values of the column col_2
into a single string for each group where the groups are defined by the values of the column col_1
.
df_2 = (df
.groupby('col_1')
.agg(col_3=('col_2', lambda x: x.str.cat(sep=', '))))
Here is how this works:
groupby()
method groups the rows of the data frame df
by the values in the column col_1
. The data aggregation operations applied by the subsequent call to agg()
are, therefore, applied to each group separately. See Summary Table.str.cat()
, it will collapse the values of the column it is called on into a single string literal.col_2
for each group into a single string literal with ', '
as a separator.str.cat()
method on the values of the col_2
for each group from within agg()
.df.groupby('col_1')['col_2'].str.cat(sep=', ')
does not work because str.cat()
is not defined for SeriesGroupBy
objects. Hence, our use of a lambda function inside agg()
to act on the sub-Series that is the values of col_2
for the current group.Extension: Collapse Entire Column
col_2_str = df['col_2'].str.cat(sep=', ')
Here is how this works:
str.cat()
, it will collapse the values of the column it is called on into a single string literal.col_2
of the data frame df
into a single string literal col_2_str
.We wish to create a new string column by inserting the value of one or more string variables into a string template.
In this example, we wish to create a new column col_3
from the values of the columns col_1
and col_2
where each value follows the template '{col_1} vs {col_2}'
for the corresponding rows.
df_2 = df.assign(
col_3 = df.apply(
lambda x: '{} vs {}'.format(x['col_1'], x['col_2']),
axis=1))
Here is how this works:
Series
method for interpolating into a string template. Therefore, we use the Python format()
function.format()
function is not vectorized, therefore we need to use apply()
with axis=1
to apply format()
in a row wise manner to each row of the data frame df
. See Non-Vectorized Transformation.lambda
function passed to apply()
takes a row of the data frame (represented by the variable x
) and uses the python function format()
to insert the values of the columns col_1
and col_2
for the current row into the template '{} vs {}'
.format()
function allows us to include placeholders denoted by {}
in a string, which are replaced with the values passed to the format()
function. format()
can also work with named placeholders such as {col}
which are replaced with the corresponding keyword argument passed to the format()
function.We wish to create a new string column by repeating the values in an existing string column a set number of times.
In this example, we wish to create a new column col_3
where each value is the repeating (concatenation to itself) of the corresponding value in the col_2
3 times.
df_2 = df.assign(
col_3 = df['col_2'].str.repeat(3))
Here is how this works:
str.repeat()
method, from the str
accessor set of string manipulation methods of Pandas Series
, to create a new Series
where each value in the original Series is repeated (concatenated to itself) a particular number of times.str.repeat()
method takes an integer argument, which in this case is 3, specifying the number of times each value of the calling string column will be repeated.df
with an additional columncol_3
holding the values from the column col_2
each repeated 3 times.We wish to pad a string with a specified character for up to a given width.
In this example, we wish to create a new column col_3
that is a padding of the values of the column col_2 to get each to 4 characters.
df_2 = df.assign(
col_3 = df['col_2'].str.pad(
width=4,
side='left',
fillchar='*'))
Here is how this works:
str.pad()
method, from the str
accessor set of string manipulation methods of Pandas Series
, to pad the values of the column col_2
by adding the character *
to the left so each string value is 4
characters long.str.pad()
method takes three arguments:width
which is the desired length of the output padded string and which in this case is 4
.side
is the side of the string on which the padding should be applied and takes one of three values 'left'
, 'right'
, and 'both'
and which in this case is left
.fillchar
which specified the character to use for padding the input string to match the given width
and which in this case is '*'
.df
with an additional columncol_3
holding the values from the column col_2
with padding applied to the left side of each value.