In this note, I use df as DataFrame, s as Series.

Libraries

import pandas as pd # import pandas package
import numpy as np

Other tasks

Things need to be checked

  1. csv file:
    1. Values are separated by , of ;?
    2. Encoding.
    3. Timestamp type.
  2. Indexes are sorted?
  3. Indexes are continuous with step 1 (especially after using .dropna() or .drop_duplicates)?
  4. Are there NaN values? Drop them?
  5. Are there duplicates? Drop them?
  6. How many unique values?
  7. For 0/1 features, they have only 2 unique values (0 and 1)?
  8. KDE plot to check the values distribution.
  9. The number of columns?
  10. Unique labels?
  11. Time series:
    1. Time range.
    2. Time step.
    3. Timestamp’s type.
    4. Timezone.
    5. Timestamps are monotonic?

Deal with columns

Remove or Keep some

# REMOVING COLUMNS
df.drop('New', axis=1, inplace=True) # drop column 'New'
df.drop(['col1', 'col2'], axis=1, inplace=True)
# ONLY KEEP SOME
kept_cols = ['col1', 'col2', ...]
df = df[kept_cols]
# ALL EXCEPT SOME
df[df.columns.difference(['b'])]

Rename columns

# IMPLICITLY
df.columns = ['Surname', 'Years', 'Grade', 'Location']

# EXPLICITLY
df.rename(columns={'Name': 'Surname', 'Ages': 'Years'}, inplace=True)
# A SPECIFIC COLUMN
data.rename(columns={'gdp':'log(gdp)'}, inplace=True)
# RENAME INDEX COLUMN
df.index.name = 'new_name'

Make index

# COLUMN HAS UNIQUE VALUES?
df['col'].is_unique # True if yes
# INDEX -> NORMAL COLUMN
df.reset_index(inplace=True)

# NORMAL COLUMN -> INDEX
df.set_index('column')
df.set_index(['col1', 'col2'])

Drop duplicates

👉 Overview duplicates.

# check duplicates
df['Student'].duplicated().any()

# remove duplicates in some columns
df.drop_duplicates(['col1', 'col2'])
# use "ignore_index=True" if you wanna reset indexes to 0,1,...,n-1

Couple different columns

df = df0[['Date', 'Heure', 'tH (°C)']].copy()
df['timestamp'] = df['Date'] + ' ' + df['Heure']

# if you use without `.copy()`
# WARNING: A value is trying to be set on a copy of a slice from a DataFrame. 

Deal with missing values NaN

👉 Overview missing values.

Drop NaN values

Full reference of dropna is here.

# Drop any rows which have any nans
df.dropna()

# Drop if all values in that row/columns are NA
df.dropna(how='all') # default: how='any'

# Drop columns that have any nans
df.dropna(axis=1)

# Only drop columns which have at least 90% non-NaNs
df.dropna(thresh=int(df.shape[0] * .9), axis=1)

Fill NaN with others

Check other methods of fillna here.

# Fill NaN with ' '
df['col'] = df['col'].fillna(' ')

# Fill NaN with 99
df['col'] = df['col'].fillna(99)

# Fill NaN with the mean of the column
df['col'] = df['col'].fillna(df['col'].mean())

Do with conditions

np.where(if_this_condition_is_true, do_this, else_this)
df['new_column'] = np.where(df[i] > 10, 'foo', 'bar) # example

Work with text data

There are a lot of methods we can work with text data (pd.Series.str). We can use it coupling with regular expression.