In this note, I use
df for DataFrame, s for Series.1import pandas as pd # import pandas package
2import numpy as np1dataquest_aio = '<https://raw.githubusercontent.com/dinhanhthi/dataquest-aio/master/step-2-data-analysis-and-visualization/>'
2dataset_url = dataquest_aio + 'course-4-data-cleaning-and-analysis/data/World_Happiness_2015.csv'
3df = pd.read_csv(dataset_url) # read the data set
4df.head()Group
df by column Region and then selct the column Western Europe1df.groupby('Region').get_group('Western Europe') # returns a dfSelect just the
Happiness Score column and then find the mean1df.groupby('Region')['Happiness Score'].mean()
2# other methods: size, max, min, count1Region
2Australia and New Zealand 7.285000
3Central and Eastern Europe 5.332931
4Eastern Asia 5.626167
5Latin America and Caribbean 6.144682
6Middle East and Northern Africa 5.406900
7North America 7.273000
8Southeastern Asia 5.317444
9Southern Asia 4.580857
10Sub-Saharan Africa 4.202800
11Western Europe 6.689619
12Name: Happiness Score, dtype: float64Apply multiple/custom functions,
1def max_min(group):
2 return group.max() - group.min()
3
4df.groupby(['Country', 'Region']).agg([np.mean, np.max, max_min]).head()If you wanna apply different functions on different columns,
1df.groupby(['Country', 'Region']).agg({
2 'Happiness Rank': max_min,
3 'Happiness Score': ['min', 'max'],
4 'Standard Error': 'count'
5}).head(3)Or using
apply and lambda function,1orders.groupby('shoes').price.apply(lambda x: np.min(x, 25)).reset_index()1df.pivot_table(values=['GDP', 'City'], index='Region') # returns dfApply some functions,
1df.pivot_table(['GDP', 'City'], 'Region', aggfunc=[np.mean, np.max], margins=True)
2# margins shows the "All" rowMake values in one columns be columns in a new "pivot" table (ref)
1df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
2 'two'],
3 'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
4 'baz': [1, 2, 3, 4, 5, 6],
5 'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
6
7pivot_1 = df.pivot(index='foo', columns='bar', values='baz')
8pivot_2 = df.pivot(index='foo', columns='bar')['baz']
9pivot_3 = df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
10
11display_side_by_side(df, pivot_1, pivot_2, pivot_3)For one who wanna know
display_side_by_side, check Jupyter notebook.Contrary to
pivot, we now want to transform several columns into values of a single column (ref).1df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
2 'B': {0: 1, 1: 3, 2: 5},
3 'C': {0: 2, 1: 4, 2: 6}})
4
5df1 = pd.melt(df, id_vars=['A'], value_vars=['B'])
6df2 = pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
7
8display_side_by_side(df, df1, df2)- Data Cleaning and Analysis on Dataquest.
- Transforming data with pandas on Dataquest.
- pandas official -- Group By: split-apply-combine