Data Processing & Cleaning

In this note, I use df as DataFrame, s as Series.

Libraries

1import pandas as pd # import pandas package
2import numpy as np

Things need to be checked

csv file:

Values are separated by , of ;?
Encoding.
Timestamp type.

Indexes are sorted?

Indexes are continuous with step 1 (especially after using .dropna() or .drop_duplicates)?

Are there NaN values? Drop them?

Are there duplicates? Drop them?

How many unique values?

For 0/1 features, they have only 2 unique values (0 and 1)?

KDE plot to check the values distribution.

The number of columns?

Unique labels?

Time series:

Time range.
Time step.
Timestamp's type.
Timezone.
Timestamps are monotonic?

Deal with columns

Remove or Keep some

1# REMOVING COLUMNS
2df.drop('New', axis=1, inplace=True) # drop column 'New'
3df.drop(['col1', 'col2'], axis=1, inplace=True)

1# ONLY KEEP SOME
2kept_cols = ['col1', 'col2', ...]
3df = df[kept_cols]

1# ALL EXCEPT SOME
2df[df.columns.difference(['b'])]

Rename columns

1# IMPLICITLY
2df.columns = ['Surname', 'Years', 'Grade', 'Location']

1# EXPLICITLY
2df.rename(columns={'Name': 'Surname', 'Ages': 'Years'}, inplace=True)

1# A SPECIFIC COLUMN
2data.rename(columns={'gdp':'log(gdp)'}, inplace=True)

1# RENAME INDEX COLUMN
2df.index.name = 'new_name'

Make index

1# COLUMN HAS UNIQUE VALUES?
2df['col'].is_unique # True if yes

1# INDEX -> NORMAL COLUMN
2df.reset_index(inplace=True)

1# NORMAL COLUMN -> INDEX
2df.set_index('column')
3df.set_index(['col1', 'col2'])

Drop duplicates

👉 Check section "Duplicates” in the note Data Overview.

1# check duplicates
2df['Student'].duplicated().any()

1# remove duplicates in some columns
2df.drop_duplicates(['col1', 'col2'])
3# use "ignore_index=True" if you wanna reset indexes to 0,1,...,n-1

Couple different columns

1df = df0[['Date', 'Heure', 'tH (°C)']].copy()
2df['timestamp'] = df['Date'] + ' ' + df['Heure']
3
4# if you use without `.copy()`
5# WARNING: A value is trying to be set on a copy of a slice from a DataFrame.

Deal with missing values `NaN`

👉 Check section "Missing values” in the note Data Overview.

Drop `NaN` values

Full reference of dropna is here.

1# Drop any rows which have any nans
2df.dropna()

1# Drop if all values in that col are NA
2df.dropna(how='all', axis=1)

1# Drop columns that have any nans
2df.dropna(axis=1)

1# Only drop columns having min 90% non-NaNs
2df.dropna(thresh=int(df.shape[0]*.9), axis=1)

1# Only keep rows having >=2 non-NA values
2df.dropna(thresh=2)

1# Only consider some cols
2df.dropna(subset=['col1', 'col2']

1# multi-index
2df.dropna(subset=[(1,'a'), (1,'b'), (2,'a'), (2,'b')])
3
4# consider all cols '1' and '2'
5df.dropna(subset=df.loc[[], [1,2]].columns)
6

Fill `NaN` with others

Check other methods of fillna here.

1# Fill NaN with ' '
2df['col'] = df['col'].fillna(' ')

1# Fill NaN with 99
2df['col'] = df['col'].fillna(99

1# mean / median of each column
2df.fillna(df.mean())

1# Fill NaN with the mean of the column
2df['col'] = df['col'].fillna(df['col'].mean())

1# Fill NA with mean of row
2m = df.mean(axis=1)
3for col in df.columns:
4    df.loc[:, col] = df.loc[:, col].fillna(m)

Do with conditions

1np.where(if_this_condition_is_true, do_this, else_this)
2df['new_column'] = np.where(df[i] > 10, 'foo', 'bar) # example

Work with text data

There are a lot of methods we can work with text data (pd.Series.str). We can use it coupling with regular expression.