Appending

We wish to combine multiple data frames into a single data frame vertically where rows are appended to each other, or horizontally, where columns are combined side by side. Concatenating rows together is referred to as Union in SQL.

This section is organized as follows:

  • Append Rows where we cover how to append two or more data frames vertically. i.e. concatenate rows while aligning common columns.
  • Append Columns where we cover how to append two or more data frames horizontally. This is similar to Joining but can only be done based on the index. We recommend using the methods covered in Joining to append data frames horizontally.

Append Rows

We wish to append two data frames together vertically where rows are appended to each other and columns with the same name in both data frames are aligned based on their names. This is also referred to as Concatenating or Union in SQL.

In this example, we wish to append df_1 and df_2 and return a single data frame with all rows and columns from both data frames.

df = pd.concat([df_1, df_2], ignore_index=True)

Here is how this works:

  • We use the concat() Pandas function to append rows from df_1 and df_2 vertically.
  • We pass a list of the data frames we wish to append to concat(), in this example [df_1, df_2].
  • By default, concat() will keep the original index. We set ignore_index=True to reset the index in case the original index is not meaningful.
  • Columns present in one data frame only will be filled with NA for rows coming from the other data frame.
  • The output data frame df will contain all the rows from both df_1 and df_2, appended one after the other and aligned based on column names. If a column only exists in one data frame, it will be added to the left of the output data frame and filled with NA for the rows coming from the other data frame.

Extension: Data Frame Identifier

We wish to create a new column of identifiers to link each row to its original data frame.

In this example, we wish to append df_1 and df_2 vertically and add a new column named source to the output data frame df.

df = pd.concat([df_1, df_2], keys=['df_1', 'df_2'])
df['source'] = df.index.get_level_values(0)

Here is how this works:

  • We use concat() similarly to the Append Rows solution above.
  • We pass a list of keys to identify each data frame in the same order they are passed to concat(). In this example keys=['df_1', 'df_2']. This will create a MultiIndex where level 0 is the keys and level 1 is the original index.
  • In df['source'] = df.index.get_level_values(0) we convert level 0 of the MultiIndex to a column named source.
  • The output data frame df is similar to the Append Rows solution above with a new column named source that links each row to the original data frame using the keys df_1 and df_2.

Extension: Chain Concat

wish to apply the concat() function in a chainable manner.

In this example, we wish to append df_1 and df_2 and return a single data frame with all rows from both data frames and rename col_5 from df_2 to col_4.

df = df_1. \
    pipe(lambda x: pd.concat([x, df_2], ignore_index=True)). \
    rename(columns={'col_5': 'col_4'})

Here is how it works:

  • We use the pipe() function from Pandas to chain together operations (like the %>% operation in R).
  • We use a lambda function to pass the df_1 (as x) and df_2 to concat() and then we rename col_5 using rename(). See Renaming.
  • The output is similar to the Append Rows scenario above with col_5 renamed to col_4.

Append Columns

We wish to append two or more data frames horizontally. This is similar to joining but can only be done based on the index. We recommend using the methods covered in Joining to append data frames horizontally.

In this example, we wish to append df_1 and df_2 horizontally and align rows based on the index of the data frames.

df = pd.concat([df_1, df_2], axis='columns', ignore_index=True)

Here is how this works:

  • We use the concat() Pandas function to append rows from df_1 and df_2 horizontally.
  • We pass a list of data frames we wish to append to concat() in this example [df_1, df_2].
  • By default, concat() works on rows (axis='rows'). We set the axis argument to axis='columns' to append columns based on the index.
  • By default, concat() will keep the original index. We set ignore_index=True to reset the index in case the original index is not meaningful.
  • The output data frame df will contain all the rows from both df_1 and df_2, appended horizontally and aligned based on the index. All columns from both data frames will be in the output data frame.
  • We recommend using methods covered in Joining to perform column appending.
PYTHON
I/O