This note is not complete. It is being updated.
This note is updated frequently without notice!

Open this html file Open In Colab

Sometimes, we wanna couple multiple dataframes together. In this note, I use df as DataFrame, s as Series.

Libraries

import pandas as pd
import numpy as np

Coupling dfs with merge()

There are 4 types of merging, like in SQL.

  • Inner: only includes elements that appear in both dataframes with a common key.
  • Outer: includes all data from both dataframes.
  • Left: includes all of the rows from the “left” dataframe along with any rows from the “right” dataframe with a common key; the result retains all columns from both of the original dataframes.
  • Right: includes all of the rows from the “right” dataframe along with any rows from the “left” dataframe with a common key; the result retains all columns from both of the original dataframes.

Types of merge Types of merge, source.

On the same column name,

# left
df_left = pd.merge(left=df1, right=df2, how='left', on='Col_1', suffixes=('_df1', '_df2'))
# right
df_right = pd.merge(left=df1, right=df2, how='right', on='Col_1', suffixes=('_df1', '_df2'))

display_side_by_side(df1, df2, df_left, df_right)

Merge left and right on the same column name

# inner (defaut)
df_inner = pd.merge(left=df1, right=df2, on='Col_1', suffixes=('_df1', '_df2'))
# outer
df_outer = pd.merge(left=df1, right=df2, how='outer', on='Col_1', suffixes=('_df1', '_df2'))

display_side_by_side(df1, df2, df_inner, df_outer)

Merge inner and outer on the same column name

On the different column names,

# left
df_left = pd.merge(left=df1, right=df2, how='left', left_on='Col_1', right_on='Col_X', suffixes=('_df1', '_df2'))

display_side_by_side(df1, df2, df_left)

Merge left on different column names

The result keeps both Col_1 and Col_X while in the case of the same column name, there is only 1 column. Other words, in this case, we only want to keep Col_1 and don’t need Col_X. How to do that?

df_left = df1.set_index('Col_1').join(df2.set_index('Col_X'), how="left", lsuffix="_df1", rsuffix="_df2").reset_index()

display_side_by_side(df1, df2, df_left)

Merge left on different column names and keep one column on the result

Concatenate dfs with concat()

# axis=0 (default)
df_concat_0 = pd.concat([df1, df2]) # the same columns
df_concat_1 = pd.concat([df1, df2], axis=1) # the same rows

df_concat_0_idx = pd.concat([df1, df2], ignore_index=True)
# ignore_index=True prevent duplicating indexes 

display_side_by_side(df1, df2, df_concat_0, df_concat_1, df_concat_0_idx)

Concatenate 2 dataframes

Combine 2 dataframes with missing values

We consider a situation in that we need to combine 2 dfs containing missing values in each. The missing values will be filled by taking from the others. For example, the value of C in the left df can be fulfilled by the value of C in the right df.

Example

We can apply rules with df.combine.

Notice an error?

Everything on this site is published on Github. Just summit a suggested change or email me directly (don't forget to include the URL containing the bug), I will fix it.