Thi's avatar
HomeAboutNotesBlogTopicsToolsReading
About|My sketches |Cooking |Cafe icon Support Thi
💌 [email protected]

Data Aggregation

Anh-Thi Dinh
Data Science
Left aside
In this note, I use df for DataFrame, s for Series.

Libraries

Dataframe

ã…¤
Country
Region
Happiness Rank
Happiness Score
Standard Error
0
Switzerland
Western Europe
1
7.587
0.03411
1
Iceland
Western Europe
2
7.561
0.04884
2
Denmark
Western Europe
3
7.527
0.03328
3
Norway
Western Europe
4
7.522
0.03880
4
Canada
North America
5
7.427
0.03553

Group dataset using groupby()

Group df by column Region and then selct the column Western Europe
ã…¤
Country
Region
Happiness Rank
Happiness Score
Standard Error
0
Switzerland
Western Europe
1
7.587
0.03411
1
Iceland
Western Europe
2
7.561
0.04884
2
Denmark
Western Europe
3
7.527
0.03328
3
Norway
Western Europe
4
7.522
0.03880
5
Finland
Western Europe
6
7.406
0.03140
Select just the Happiness Score column and then find the mean
Apply multiple/custom functions,
ã…¤
ã…¤
Happiness Rank
ã…¤
ã…¤
Happiness Score
ã…¤
ã…¤
ã…¤
ã…¤
mean
amax
max_min
mean
amax
max_min
Country
Region
ã…¤
ã…¤
ã…¤
ã…¤
ã…¤
ã…¤
Afghanistan
Southern Asia
153
153
0
3.575
3.575
0.0
Albania
Central Europe
95
95
0
4.959
4.959
0.0
Algeria
Middle Africa
68
68
0
5.605
5.605
0.0
If you wanna apply different functions on different columns,
ã…¤
ã…¤
Happiness Rank
Happiness Score
ã…¤
ã…¤
ã…¤
max_min
min
max
Country
Region
ã…¤
ã…¤
ã…¤
Afghanistan
Southern Asia
0
3.575
3.575
Albania
Central Europe
0
4.959
4.959
Algeria
Middle Africa
0
5.605
5.605
Or using apply and lambda function,

Group using pivot_table()

An example of pivotting by a single column (ref)
Group by Region (as an index) and choosing GDP and City columns (ref)
ã…¤
Happiness Rank
Standard Error
Region
ã…¤
ã…¤
Australia and New Zealand
9.5
0.037270
Central and Eastern Europe
79.0
0.045208
Eastern Asia
64.5
0.037225
Apply some functions,
ã…¤
mean
ã…¤
amax
ã…¤
ã…¤
Happiness Rank
Standard Error
Happiness Rank
Standard Error
Region
ã…¤
ã…¤
ã…¤
ã…¤
Australia and New Zealand
9.5
0.037270
10
0.04083
Central and Eastern Europe
79.0
0.045208
134
0.06913
Eastern Asia
64.5
0.037225
100
0.05051

Reorganizing df using pivot()

An example of multi-column pivoting (ref)
Make values in one columns be columns in a new "pivot" table (ref)
ã…¤
foo
bar
baz
zoo
0
one
A
1
x
1
one
B
2
y
2
one
C
3
z
3
two
A
4
q
4
two
B
5
w
5
two
C
6
t
bar
A
B
C
foo
ã…¤
ã…¤
ã…¤
one
1
2
3
two
4
5
6
bar
A
B
C
foo
ã…¤
ã…¤
ã…¤
one
1
2
3
two
4
5
6
ã…¤
baz
ã…¤
ã…¤
zoo
ã…¤
ã…¤
bar
A
B
C
A
B
C
foo
ã…¤
ã…¤
ã…¤
ã…¤
ã…¤
ã…¤
one
1
2
3
x
y
z
two
4
5
6
q
w
t
For one who wanna know display_side_by_side, check Jupyter notebook.

Change shape of df with melt()

Contrary to pivot, we now want to transform several columns into values of a single column (ref).

References

  • Data Cleaning and Analysis on Dataquest.
  • Transforming data with pandas on Dataquest.
  • pandas official -- Group By: split-apply-combine
â—†Librariesâ—†Dataframeâ—†Group dataset using groupby()â—†Group using pivot_table()â—†Reorganizing df using pivot()â—†Change shape of df with melt()â—†References
About|My sketches |Cooking |Cafe icon Support Thi
💌 [email protected]
1import pandas as pd # import pandas package
2import numpy as np
1dataquest_aio = '<https://raw.githubusercontent.com/dinhanhthi/dataquest-aio/master/step-2-data-analysis-and-visualization/>'
2dataset_url = dataquest_aio + 'course-4-data-cleaning-and-analysis/data/World_Happiness_2015.csv'
3df = pd.read_csv(dataset_url) # read the data set
4df.head()
1df.groupby('Region').get_group('Western Europe') # returns a df
1df.groupby('Region')['Happiness Score'].mean()
2# other methods: size, max, min, count
1Region
2Australia and New Zealand          7.285000
3Central and Eastern Europe         5.332931
4Eastern Asia                       5.626167
5Latin America and Caribbean        6.144682
6Middle East and Northern Africa    5.406900
7North America                      7.273000
8Southeastern Asia                  5.317444
9Southern Asia                      4.580857
10Sub-Saharan Africa                 4.202800
11Western Europe                     6.689619
12Name: Happiness Score, dtype: float64
1def max_min(group):
2  return group.max() - group.min()
3
4df.groupby(['Country', 'Region']).agg([np.mean, np.max, max_min]).head()
1df.groupby(['Country', 'Region']).agg({
2    'Happiness Rank': max_min,
3    'Happiness Score': ['min', 'max'],
4    'Standard Error': 'count'
5}).head(3)
1orders.groupby('shoes').price.apply(lambda x: np.min(x, 25)).reset_index()
1df.pivot_table(values=['GDP', 'City'], index='Region') # returns df
1df.pivot_table(['GDP', 'City'], 'Region', aggfunc=[np.mean, np.max], margins=True)
2# margins shows the "All" row
1df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
2                           'two'],
3                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
4                   'baz': [1, 2, 3, 4, 5, 6],
5                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
6
7pivot_1 = df.pivot(index='foo', columns='bar', values='baz')
8pivot_2 = df.pivot(index='foo', columns='bar')['baz']
9pivot_3 = df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
10
11display_side_by_side(df, pivot_1, pivot_2, pivot_3)
1df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
2                   'B': {0: 1, 1: 3, 2: 5},
3                   'C': {0: 2, 1: 4, 2: 6}})
4
5df1 = pd.melt(df, id_vars=['A'], value_vars=['B'])
6df2 = pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
7
8display_side_by_side(df, df1, df2)