This note is updated frequently without notice!

In this note, I use df as DataFrame, s as Series.


import pandas as pd # import pandas package
import numpy as np

Other tasks

Deal with columns

Remove or Keep some

Removing columns,

df.drop('New', axis=1, inplace=True) # drop column 'New'
df.drop(['col1', 'col2'], axis=1, inplace=True)

Only keep some,

kept_cols = ['col1', 'col2', ...]
df = df[kept_cols]

Choose all columns except some,


Rename columns

In this part, we are going to use below dataframe df.

  Name Ages Marks Place
0 John 10 8 Ben Tre
1 Thi 20 9 Paris
# implicitly
df.columns = ['Surname', 'Years', 'Grade', 'Location']

# explicitly
  'Name': 'Surname',
  'Ages': 'Years',
}, inplace=True)

We can use the explicit method to rename a specific column in df.

data.rename(columns={'gdp':'log(gdp)'}, inplace=True)

Make index

Check if a column has unique values (so that it can be an index)

df['col'].is_unique # True if yes

Transform an index to column to a normal column,


Make a column be an index,[ref]

df.set_index(['col1', 'col2'])

Deal with missing values NaN

Drop if NaN

# Drop any rows which have any nans

# Drop columns that have any nans

# Only drop columns which have at least 90% non-NaNs
df.dropna(thresh=int(df.shape[0] * .9), axis=1)

Fill NaN with others

Check other methods of fillna here.

# Fill NaN with ' '
df['col'] = df['col'].fillna(' ')

# Fill NaN with 99
df['col'] = df['col'].fillna(99)

# Fill NaN with the mean of the column
df['col'] = df['col'].fillna(df['col'].mean())

Do with conditions

np.where(if_this_condition_is_true, do_this, else_this)
df['new_column'] = np.where(df[i] > 10, 'foo', 'bar) # example

Work with text data

There are a lot of methods we can work with text data (pd.Series.str). We can use it coupling with regular expression.

Notice an error?

Everything on this site is published on Github. Just summit a suggested change or email me directly (don't forget to include the URL containing the bug), I will fix it.