Data Aggregation

15-12-2020 / Edit on Github

In this note, I use df for DataFrame, s for Series.

Libraries #

import pandas as pd # import pandas package
import numpy as np

Dataframe #

dataquest_aio = 'https://raw.githubusercontent.com/dinhanhthi/dataquest-aio/master/step-2-data-analysis-and-visualization/'
dataset_url = dataquest_aio + 'course-4-data-cleaning-and-analysis/data/World_Happiness_2015.csv'
df = pd.read_csv(dataset_url) # read the data set
df.head()
CountryRegionHappiness RankHappiness ScoreStandard Error
0SwitzerlandWestern Europe17.5870.03411
1IcelandWestern Europe27.5610.04884
2DenmarkWestern Europe37.5270.03328
3NorwayWestern Europe47.5220.03880
4CanadaNorth America57.4270.03553

Group dataset using groupby() #

Group df by column Region and then selct the column Western Europe,

df.groupby('Region').get_group('Western Europe') # returns a df
CountryRegionHappiness RankHappiness ScoreStandard Error
0SwitzerlandWestern Europe17.5870.03411
1IcelandWestern Europe27.5610.04884
2DenmarkWestern Europe37.5270.03328
3NorwayWestern Europe47.5220.03880
5FinlandWestern Europe67.4060.03140

Select just the Happiness Score column and then find the mean,

df.groupby('Region')['Happiness Score'].mean()
# other methods: size, max, min, count
Region
Australia and New Zealand          7.285000
Central and Eastern Europe         5.332931
Eastern Asia                       5.626167
Latin America and Caribbean        6.144682
Middle East and Northern Africa    5.406900
North America                      7.273000
Southeastern Asia                  5.317444
Southern Asia                      4.580857
Sub-Saharan Africa                 4.202800
Western Europe                     6.689619
Name: Happiness Score, dtype: float64

Apply multiple/custom functions,

def max_min(group):
return group.max() - group.min()

df.groupby(['Country', 'Region']).agg([np.mean, np.max, max_min]).head()
Happiness RankHappiness Score
meanamaxmax_minmeanamaxmax_min
CountryRegion
AfghanistanSouthern Asia15315303.5753.5750.0
AlbaniaCentral Europe959504.9594.9590.0
AlgeriaMiddle Africa686805.6055.6050.0

If you wanna apply different functions on different columns,

df.groupby(['Country', 'Region']).agg({
'Happiness Rank': max_min,
'Happiness Score': ['min', 'max'],
'Standard Error': 'count'
}).head(3)
Happiness RankHappiness Score
max_minminmax
CountryRegion
AfghanistanSouthern Asia03.5753.575
AlbaniaCentral Europe04.9594.959
AlgeriaMiddle Africa05.6055.605

Or using apply and lambda function,

orders.groupby('shoes').price.apply(lambda x: np.min(x, 25)).reset_index()

Group using pivot_table() #

An example of pivotting by a single column
An example of pivotting by a single column[ref]

Group by Region (as an index) and choosing GDP and City columns,[ref]

df.pivot_table(values=['GDP', 'City'], index='Region') # returns df
Happiness RankStandard Error
Region
Australia and New Zealand9.50.037270
Central and Eastern Europe79.00.045208
Eastern Asia64.50.037225

Apply some functions,

df.pivot_table(['GDP', 'City'], 'Region', aggfunc=[np.mean, np.max], margins=True)
# margins shows the "All" row
meanamax
Happiness RankStandard ErrorHappiness RankStandard Error
Region
Australia and New Zealand9.50.037270100.04083
Central and Eastern Europe79.00.0452081340.06913
Eastern Asia64.50.0372251000.05051

Reorganizing df using pivot() #

An example of multi-column pivoting
An example of multi-column pivoting (ref)

Make values in one columns be columns in a new "pivot" table,[ref]

df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
'two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': ['x', 'y', 'z', 'q', 'w', 't']})

pivot_1 = df.pivot(index='foo', columns='bar', values='baz')
pivot_2 = df.pivot(index='foo', columns='bar')['baz']
pivot_3 = df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])

display_side_by_side(df, pivot_1, pivot_2, pivot_3)
foobarbazzoo
0oneA1x
1oneB2y
2oneC3z
3twoA4q
4twoB5w
5twoC6t
barABC
foo
one123
two456
barABC
foo
one123
two456
bazzoo
barABCABC
foo
one123xyz
two456qwt

For one who wanna know display_side_by_side, check this note.

Change shape of df with melt() #

Contrary to pivot, we now want to transform several columns into values of a single column,[ref]

df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})

df1 = pd.melt(df, id_vars=['A'], value_vars=['B'])
df2 = pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])

display_side_by_side(df, df1, df2)
ABC
0a12
1b34
2c56
Avariablevalue
0aB1
1bB3
2cB5
Avariablevalue
0aB1
1bB3
2cB5
3aC2
4bC4
5cC6

References #