Data Processing & Cleaning

In this note, I use df as DataFrame, s as Series.

Libraries

Things need to be checked

csv file:

Values are separated by , of ;?
Encoding.
Timestamp type.

Indexes are sorted?

Indexes are continuous with step 1 (especially after using .dropna() or .drop_duplicates)?

Are there NaN values? Drop them?

Are there duplicates? Drop them?

How many unique values?

For 0/1 features, they have only 2 unique values (0 and 1)?

KDE plot to check the values distribution.

The number of columns?

Unique labels?

Time series:

Time range.
Time step.
Timestamp's type.
Timezone.
Timestamps are monotonic?

Deal with columns

Remove or Keep some

Rename columns

Make index

Drop duplicates

👉 Check section "Duplicates” in the note Data Overview.

Couple different columns

Deal with missing values `NaN`

👉 Check section "Missing values” in the note Data Overview.

Drop `NaN` values

Full reference of dropna is here.

Fill `NaN` with others

Check other methods of fillna here.

Do with conditions

Work with text data

There are a lot of methods we can work with text data (pd.Series.str). We can use it coupling with regular expression.

In this note, I use df as DataFrame, s as Series.

Libraries

1import pandas as pd # import pandas package
2import numpy as np

Things need to be checked

csv file:

Values are separated by , of ;?
Encoding.
Timestamp type.

Indexes are sorted?

Indexes are continuous with step 1 (especially after using .dropna() or .drop_duplicates)?

Are there NaN values? Drop them?

Are there duplicates? Drop them?

How many unique values?

For 0/1 features, they have only 2 unique values (0 and 1)?

KDE plot to check the values distribution.

The number of columns?

Unique labels?

Time series:

Time range.
Time step.
Timestamp's type.
Timezone.
Timestamps are monotonic?

Deal with columns

Remove or Keep some

1# REMOVING COLUMNS
2df.drop('New', axis=1, inplace=True) # drop column 'New'
3df.drop(['col1', 'col2'], axis=1, inplace=True)

1# ONLY KEEP SOME
2kept_cols = ['col1', 'col2', ...]
3df = df[kept_cols]

1# ALL EXCEPT SOME
2df[df.columns.difference(['b'])]

Rename columns

1# IMPLICITLY
2df.columns = ['Surname', 'Years', 'Grade', 'Location']

1# EXPLICITLY
2df.rename(columns={'Name': 'Surname', 'Ages': 'Years'}, inplace=True)

1# A SPECIFIC COLUMN
2data.rename(columns={'gdp':'log(gdp)'}, inplace=True)

1# RENAME INDEX COLUMN
2df.index.name = 'new_name'

Make index

1# COLUMN HAS UNIQUE VALUES?
2df['col'].is_unique # True if yes

1# INDEX -> NORMAL COLUMN
2df.reset_index(inplace=True)

1# NORMAL COLUMN -> INDEX
2df.set_index('column')
3df.set_index(['col1', 'col2'])

Drop duplicates

👉 Check section "Duplicates” in the note Data Overview.

1# check duplicates
2df['Student'].duplicated().any()

1# remove duplicates in some columns
2df.drop_duplicates(['col1', 'col2'])
3# use "ignore_index=True" if you wanna reset indexes to 0,1,...,n-1

Couple different columns

1df = df0[['Date', 'Heure', 'tH (°C)']].copy()
2df['timestamp'] = df['Date'] + ' ' + df['Heure']
3
4# if you use without `.copy()`
5# WARNING: A value is trying to be set on a copy of a slice from a DataFrame.

Deal with missing values `NaN`

👉 Check section "Missing values” in the note Data Overview.

Drop `NaN` values

Full reference of dropna is here.

1# Drop any rows which have any nans
2df.dropna()

1# Drop if all values in that col are NA
2df.dropna(how='all', axis=1)

1# Drop columns that have any nans
2df.dropna(axis=1)

1# Only drop columns having min 90% non-NaNs
2df.dropna(thresh=int(df.shape[0]*.9), axis=1)

1# Only keep rows having >=2 non-NA values
2df.dropna(thresh=2)

1# Only consider some cols
2df.dropna(subset=['col1', 'col2']

1# multi-index
2df.dropna(subset=[(1,'a'), (1,'b'), (2,'a'), (2,'b')])
3
4# consider all cols '1' and '2'
5df.dropna(subset=df.loc[[], [1,2]].columns)
6

Fill `NaN` with others

Check other methods of fillna here.

1# Fill NaN with ' '
2df['col'] = df['col'].fillna(' ')

1# Fill NaN with 99
2df['col'] = df['col'].fillna(99

1# mean / median of each column
2df.fillna(df.mean())

1# Fill NaN with the mean of the column
2df['col'] = df['col'].fillna(df['col'].mean())

1# Fill NA with mean of row
2m = df.mean(axis=1)
3for col in df.columns:
4    df.loc[:, col] = df.loc[:, col].fillna(m)

Do with conditions

1np.where(if_this_condition_is_true, do_this, else_this)
2df['new_column'] = np.where(df[i] > 10, 'foo', 'bar) # example

Work with text data

There are a lot of methods we can work with text data (pd.Series.str). We can use it coupling with regular expression.

Data Processing & Cleaning

Data Processing & Cleaning

Libraries

Things need to be checked

Deal with columns

Remove or Keep some

Rename columns

Make index

Drop duplicates

Couple different columns

Deal with missing values NaN

Drop NaN values

Fill NaN with others

Do with conditions

Work with text data

Data Processing & Cleaning

Data Processing & Cleaning

Libraries

Things need to be checked

Deal with columns

Remove or Keep some

Rename columns

Make index

Drop duplicates

Couple different columns

Deal with missing values NaN

Drop NaN values

Fill NaN with others

Do with conditions

Work with text data

Deal with missing values `NaN`

Drop `NaN` values

Fill `NaN` with others

Deal with missing values `NaN`

Drop `NaN` values

Fill `NaN` with others