Data Aggregation

In this note, I use df for DataFrame, s for Series.

Libraries

1import pandas as pd # import pandas package
2import numpy as np

Dataframe

1dataquest_aio = '<https://raw.githubusercontent.com/dinhanhthi/dataquest-aio/master/step-2-data-analysis-and-visualization/>'
2dataset_url = dataquest_aio + 'course-4-data-cleaning-and-analysis/data/World_Happiness_2015.csv'
3df = pd.read_csv(dataset_url) # read the data set
4df.head()

ㅤ	Country	Region	Happiness Rank	Happiness Score	Standard Error
0	Switzerland	Western Europe	1	7.587	0.03411
1	Iceland	Western Europe	2	7.561	0.04884
2	Denmark	Western Europe	3	7.527	0.03328
3	Norway	Western Europe	4	7.522	0.03880
4	Canada	North America	5	7.427	0.03553

Group dataset using `groupby()`

Group df by column Region and then selct the column Western Europe

1df.groupby('Region').get_group('Western Europe') # returns a df

ㅤ	Country	Region	Happiness Rank	Happiness Score	Standard Error
0	Switzerland	Western Europe	1	7.587	0.03411
1	Iceland	Western Europe	2	7.561	0.04884
2	Denmark	Western Europe	3	7.527	0.03328
3	Norway	Western Europe	4	7.522	0.03880
5	Finland	Western Europe	6	7.406	0.03140

Select just the Happiness Score column and then find the mean

1df.groupby('Region')['Happiness Score'].mean()
2# other methods: size, max, min, count

1Region
2Australia and New Zealand          7.285000
3Central and Eastern Europe         5.332931
4Eastern Asia                       5.626167
5Latin America and Caribbean        6.144682
6Middle East and Northern Africa    5.406900
7North America                      7.273000
8Southeastern Asia                  5.317444
9Southern Asia                      4.580857
10Sub-Saharan Africa                 4.202800
11Western Europe                     6.689619
12Name: Happiness Score, dtype: float64

Apply multiple/custom functions,

1def max_min(group):
2  return group.max() - group.min()
3
4df.groupby(['Country', 'Region']).agg([np.mean, np.max, max_min]).head()

ㅤ	ㅤ	Happiness Rank	ㅤ	ㅤ	Happiness Score	ㅤ	ㅤ
ㅤ	ㅤ	mean	amax	max_min	mean	amax	max_min
Country	Region	ㅤ	ㅤ	ㅤ	ㅤ	ㅤ	ㅤ
Afghanistan	Southern Asia	153	153	0	3.575	3.575	0.0
Albania	Central Europe	95	95	0	4.959	4.959	0.0
Algeria	Middle Africa	68	68	0	5.605	5.605	0.0

If you wanna apply different functions on different columns,

1df.groupby(['Country', 'Region']).agg({
2    'Happiness Rank': max_min,
3    'Happiness Score': ['min', 'max'],
4    'Standard Error': 'count'
5}).head(3)

ㅤ	ㅤ	Happiness Rank	Happiness Score	ㅤ
ㅤ	ㅤ	max_min	min	max
Country	Region	ㅤ	ㅤ	ㅤ
Afghanistan	Southern Asia	0	3.575	3.575
Albania	Central Europe	0	4.959	4.959
Algeria	Middle Africa	0	5.605	5.605

Or using apply and lambda function,

1orders.groupby('shoes').price.apply(lambda x: np.min(x, 25)).reset_index()

Group using `pivot_table()`

An example of pivotting by a single column (ref)

Group by Region (as an index) and choosing GDP and City columns (ref)

1df.pivot_table(values=['GDP', 'City'], index='Region') # returns df

ㅤ	Happiness Rank	Standard Error
Region	ㅤ	ㅤ
Australia and New Zealand	9.5	0.037270
Central and Eastern Europe	79.0	0.045208
Eastern Asia	64.5	0.037225

Apply some functions,

1df.pivot_table(['GDP', 'City'], 'Region', aggfunc=[np.mean, np.max], margins=True)
2# margins shows the "All" row

ㅤ	mean	ㅤ	amax	ㅤ
ㅤ	Happiness Rank	Standard Error	Happiness Rank	Standard Error
Region	ㅤ	ㅤ	ㅤ	ㅤ
Australia and New Zealand	9.5	0.037270	10	0.04083
Central and Eastern Europe	79.0	0.045208	134	0.06913
Eastern Asia	64.5	0.037225	100	0.05051

Reorganizing df using `pivot()`

An example of multi-column pivoting (ref)

Make values in one columns be columns in a new "pivot" table (ref)

1df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
2                           'two'],
3                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
4                   'baz': [1, 2, 3, 4, 5, 6],
5                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
6
7pivot_1 = df.pivot(index='foo', columns='bar', values='baz')
8pivot_2 = df.pivot(index='foo', columns='bar')['baz']
9pivot_3 = df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
10
11display_side_by_side(df, pivot_1, pivot_2, pivot_3)

ㅤ	foo	bar	baz	zoo
0	one	A	1	x
1	one	B	2	y
2	one	C	3	z
3	two	A	4	q
4	two	B	5	w
5	two	C	6	t

bar	A	B	C
foo	ㅤ	ㅤ	ㅤ
one	1	2	3
two	4	5	6

bar	A	B	C
foo	ㅤ	ㅤ	ㅤ
one	1	2	3
two	4	5	6

ㅤ	baz	ㅤ	ㅤ	zoo	ㅤ	ㅤ
bar	A	B	C	A	B	C
foo	ㅤ	ㅤ	ㅤ	ㅤ	ㅤ	ㅤ
one	1	2	3	x	y	z
two	4	5	6	q	w	t

For one who wanna know display_side_by_side, check Jupyter notebook.

Change shape of df with `melt()`

Contrary to pivot, we now want to transform several columns into values of a single column (ref).

1df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
2                   'B': {0: 1, 1: 3, 2: 5},
3                   'C': {0: 2, 1: 4, 2: 6}})
4
5df1 = pd.melt(df, id_vars=['A'], value_vars=['B'])
6df2 = pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
7
8display_side_by_side(df, df1, df2)