Thi's avatar
HomeAboutNotesBlogTopicsToolsReading
About|My sketches |Cooking |Cafe icon Support Thi
💌 [email protected]

Data Processing & Cleaning

Anh-Thi Dinh
Data Science
Left aside
In this note, I use df as DataFrame, s as Series.

Libraries

Things need to be checked

  1. csv file:
    1. Values are separated by , of ;?
    2. Encoding.
    3. Timestamp type.
  1. Indexes are sorted?
  1. Indexes are continuous with step 1 (especially after using .dropna() or .drop_duplicates)?
  1. Are there NaN values? Drop them?
  1. Are there duplicates? Drop them?
  1. How many unique values?
  1. For 0/1 features, they have only 2 unique values (0 and 1)?
  1. KDE plot to check the values distribution.
  1. The number of columns?
  1. Unique labels?
  1. Time series:
    1. Time range.
    2. Time step.
    3. Timestamp's type.
    4. Timezone.
    5. Timestamps are monotonic?

Deal with columns

Remove or Keep some

Rename columns

Make index

Drop duplicates

👉 Check section "Duplicates” in the note Data Overview.

Couple different columns

Deal with missing values NaN

👉 Check section "Missing values” in the note Data Overview.

Drop NaN values

Full reference of dropna is here.

Fill NaN with others

Check other methods of fillna here.

Do with conditions

Work with text data

There are a lot of methods we can work with text data (pd.Series.str). We can use it coupling with regular expression.
 
◆Libraries◆Things need to be checked◆Deal with columns○Remove or Keep some○Rename columns○Make index○Drop duplicates○Couple different columns◆Deal with missing values NaN○Drop NaN values○Fill NaN with others◆Do with conditions◆Work with text data
About|My sketches |Cooking |Cafe icon Support Thi
💌 [email protected]
1import pandas as pd # import pandas package
2import numpy as np
1# REMOVING COLUMNS
2df.drop('New', axis=1, inplace=True) # drop column 'New'
3df.drop(['col1', 'col2'], axis=1, inplace=True)
1# ONLY KEEP SOME
2kept_cols = ['col1', 'col2', ...]
3df = df[kept_cols]
1# ALL EXCEPT SOME
2df[df.columns.difference(['b'])]
1# IMPLICITLY
2df.columns = ['Surname', 'Years', 'Grade', 'Location']
1# EXPLICITLY
2df.rename(columns={'Name': 'Surname', 'Ages': 'Years'}, inplace=True)
1# A SPECIFIC COLUMN
2data.rename(columns={'gdp':'log(gdp)'}, inplace=True)
1# RENAME INDEX COLUMN
2df.index.name = 'new_name'
1# COLUMN HAS UNIQUE VALUES?
2df['col'].is_unique # True if yes
1# INDEX -> NORMAL COLUMN
2df.reset_index(inplace=True)
1# NORMAL COLUMN -> INDEX
2df.set_index('column')
3df.set_index(['col1', 'col2'])
1# check duplicates
2df['Student'].duplicated().any()
1# remove duplicates in some columns
2df.drop_duplicates(['col1', 'col2'])
3# use "ignore_index=True" if you wanna reset indexes to 0,1,...,n-1
1df = df0[['Date', 'Heure', 'tH (°C)']].copy()
2df['timestamp'] = df['Date'] + ' ' + df['Heure']
3
4# if you use without `.copy()`
5# WARNING: A value is trying to be set on a copy of a slice from a DataFrame.
1# Drop any rows which have any nans
2df.dropna()
1# Drop if all values in that col are NA
2df.dropna(how='all', axis=1)
1# Drop columns that have any nans
2df.dropna(axis=1)
1# Only drop columns having min 90% non-NaNs
2df.dropna(thresh=int(df.shape[0]*.9), axis=1)
1# Only keep rows having >=2 non-NA values
2df.dropna(thresh=2)
1# Only consider some cols
2df.dropna(subset=['col1', 'col2']
1# multi-index
2df.dropna(subset=[(1,'a'), (1,'b'), (2,'a'), (2,'b')])
3
4# consider all cols '1' and '2'
5df.dropna(subset=df.loc[[], [1,2]].columns)
6
1# Fill NaN with ' '
2df['col'] = df['col'].fillna(' ')
1# Fill NaN with 99
2df['col'] = df['col'].fillna(99
1# mean / median of each column
2df.fillna(df.mean())
1# Fill NaN with the mean of the column
2df['col'] = df['col'].fillna(df['col'].mean())
1# Fill NA with mean of row
2m = df.mean(axis=1)
3for col in df.columns:
4    df.loc[:, col] = df.loc[:, col].fillna(m)
1np.where(if_this_condition_is_true, do_this, else_this)
2df['new_column'] = np.where(df[i] > 10, 'foo', 'bar) # example