Table of Contents
Introduction To GroupBy Function in pandas
GroupBy Function in pandas and aggregation are some of the most frequently used operations in data analysis, especially while doing exploratory data analysis (EDA), where comparing summary statistics across groups of data is common.
For e.g., Suppose you have cities data and you want to analysye that the overall population of city and state aof average population of cities and state and according to each city population according to state that time we used this group-by and aggregation function to calculate value accordin to common values in the state and cities.
Grouping analysis can be thought of as having three parts:
1. Splitting the data into groups (e.g. groups of customer segments, product categories, etc.)
2. Applying a function to each group (e.g. mean or total sales of each customer segment)
3. Combining the results into a data structure showing the summary statistics
Applying GroupBy Function to groups in pandas
- Aggregation
- Transformation
- Filtration
- Applying our own function
Methods of GroupBy Function in pandas
Given data frame for apply gropuby and aggregation method
Code:-
import numpy as np
population = DataFrame({'State':['Maharashtra','Maharashtra','Maharashtra',
'Uttar Pradesh','Uttar Pradesh',
'Madhya Pradesh','Madhya Pradesh','Madhya Pradesh',
'Tamil Nadu','Tamil Nadu'],
'Cities':['Nagpur','Nagpur','Mumbai',
'Lucknow','Kanpur',
'Bhopal','Indore','Indore',
'Chennai','Chennai'],
'Female Population': np.random.randint(100000,500000,10),
'Male Population': np.random.randint(100000,500000,10),
'Total Population':np.random.randint(200000,700000,10),
'literacy_rate_total':np.abs(np.random.randn(10)*40)})
# np.random.randint() is used for generate random numbers in data
# np.ramdom.randn() is used for generate random normal number in data
population # To Show output of data frame
Output:-
Widget not in any sidebars
State | Cities | Female Population | Male Population | Total Population | literacy_rate_total | |
0 | Maharashtra | Nagpur | 330731 | 238582 | 645743 | 67.694631 |
1 | Maharashtra | Nagpur | 339232 | 418329 | 572761 | 2.930283 |
2 | Maharashtra | Mumbai | 296622 | 318827 | 290637 | 11.248097 |
3 | Uttar Pradesh | Lucknow | 260631 | 312123 | 347929 | 0.849086 |
4 | Uttar Pradesh | Kanpur | 415753 | 438568 | 530431 | 40.991017 |
5 | Madhya Pradesh | Bhopal | 192435 | 145430 | 201615 | 62.159132 |
6 | Madhya Pradesh | Indore | 439355 | 368932 | 347257 | 35.704109 |
7 | Madhya Pradesh | Indore | 220944 | 476333 | 201672 | 14.285272 |
8 | Tamil Nadu | Chennai | 297248 | 309440 | 548959 | 20.368355 |
9 | Tamil Nadu | Chennai | 261947 | 174803 | 217022 | 22.798189 |
Groupby on the basis of single categorical column
Example:- Use the Above population data and create a group of states.
# It just Showing the output as a group is created according to State
population.groupby('State')
Output:-
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000211A6521CF8> # To show the How Group value and their index we used .groups function population.groupby('State').groups
Output:-
{'Madhya Pradesh': Int64Index([5, 6, 7], dtype='int64'),
'Maharashtra': Int64Index([0, 1, 2], dtype='int64'),
'Tamil Nadu': Int64Index([8, 9], dtype='int64'),
'Uttar Pradesh': Int64Index([3, 4], dtype='int64')}
# Apply Aggrigartion function To calculate Total Population of each state
# Their is number of aggrigationfunction like( sum, mean, count, max, min,etc )
population.groupby('State').sum()
Output:-
Female Population Male Population Total Population literacy_rate_total
State
Madhya Pradesh 778700 884884 1255651 46.009551
Maharashtra 783701 1154190 1551671 105.044499
Tamil Nadu 776934 355226 998136 45.691927
Uttar Pradesh 666343 865744 876842 137.275400
Here it shows the output as the sum of all the column values in according to states
Groupby on the basis of Two categorical column
Example:
# It just Showing the output as a group is created according to State
population.groupby(['State','Cities'])
Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000211A6521CF8>
# To show the Group value and their index we used .groups function
population.groupby('State').groups
Output:-
{'Madhya Pradesh': Int64Index([5, 6, 7], dtype='int64'),
'Maharashtra': Int64Index([0, 1, 2], dtype='int64'),
'Tamil Nadu': Int64Index([8, 9], dtype='int64'),
'Uttar Pradesh': Int64Index([3, 4], dtype='int64')}
# Apply Aggregation function To calculate the mean of each state according to their cities
mean_pop = population.groupby(['State','Cities']).mean()
mean_pop
# It shows the output of each cities average female population, male population, total population and the literacy_rate_total
Output:-
Female Population Male Population Total Population literacy_rate_total
State Cities
Madhya Pradesh Bhopal 190726.0 290946.0 321898.0 6.824701
Indore 293987.0 296969.0 466876.5 19.592425
Maharashtra Mumbai 186423.0 325910.0 680144.0 36.866375
Nagpur 298639.0 414140.0 435763.5 34.089062
Tamil Nadu Chennai 388467.0 177613.0 499068.0 22.845964
Uttar Pradesh Kanpur 290262.0 425102.0 214901.0 35.262222
Lucknow 376081.0 440642.0 661941.0 102.013178
Widget not in any sidebars
Loop over GroupBy groups
In this part iterating an element of group containing and shows their values as output.
Example:-
# iterating an element of group containing and shows their values
# create Group according to State
grp = population.groupby(['State'])
for name, group in grp:
print(name)
print(group)
print()
Output:-
Madhya Pradesh
State Cities Female Population Male Population \
5 Madhya Pradesh Bhopal 190726 290946
6 Madhya Pradesh Indore 186505 381920
7 Madhya Pradesh Indore 401469 212018
Total Population literacy_rate_total
5 321898 6.824701
6 628995 26.194098
7 304758 12.990752
Maharashtra
State Cities Female Population Male Population Total Population \
0 Maharashtra Nagpur 334934 357959 508852
1 Maharashtra Nagpur 262344 470321 362675
2 Maharashtra Mumbai 186423 325910 680144
literacy_rate_total
0 33.795318
1 34.382807
2 36.866375
Tamil Nadu
State Cities Female Population Male Population Total Population \
8 Tamil Nadu Chennai 394035 109944 515960
9 Tamil Nadu Chennai 382899 245282 482176
literacy_rate_total
8 17.705373
9 27.986555
Uttar Pradesh
State Cities Female Population Male Population \
3 Uttar Pradesh Lucknow 376081 440642
4 Uttar Pradesh Kanpur 290262 425102
Total Population literacy_rate_total
3 661941 102.013178
4 214901 35.262222
Example:-
# iterating an element of group containing and shows their values
# create Group according to State and its Cities
grp = population.groupby(['State','Cities'])
for name, group in grp:
print(name)
print(group)
print()
Output:-
('Madhya Pradesh', 'Bhopal')
State Cities Female Population Male Population \
5 Madhya Pradesh Bhopal 190726 290946
Total Population literacy_rate_total
5 321898 6.824701
('Madhya Pradesh', 'Indore')
State Cities Female Population Male Population \
6 Madhya Pradesh Indore 186505 381920
7 Madhya Pradesh Indore 401469 212018
Total Population literacy_rate_total
6 628995 26.194098
7 304758 12.990752
('Maharashtra', 'Mumbai')
State Cities Female Population Male Population Total Population \
2 Maharashtra Mumbai 186423 325910 680144
literacy_rate_total
2 36.866375
('Maharashtra', 'Nagpur')
State Cities Female Population Male Population Total Population \
0 Maharashtra Nagpur 334934 357959 508852
1 Maharashtra Nagpur 262344 470321 362675
literacy_rate_total
0 33.795318
1 34.382807
('Tamil Nadu', 'Chennai')
State Cities Female Population Male Population Total Population \
8 Tamil Nadu Chennai 394035 109944 515960
9 Tamil Nadu Chennai 382899 245282 482176
literacy_rate_total
8 17.705373
9 27.986555
('Uttar Pradesh', 'Kanpur')
State Cities Female Population Male Population \
4 Uttar Pradesh Kanpur 290262 425102
Total Population literacy_rate_total
4 214901 35.262222
('Uttar Pradesh', 'Lucknow')
State Cities Female Population Male Population \
3 Uttar Pradesh Lucknow 376081 440642
Total Population literacy_rate_total
3 661941 102.013178
Selecting groups
If you want to select particular group from groupby the used groypby.get_group Function.
Example:- Select particular group Maharashtra
Code:-
# selecting a single group
grp = population.groupby('State')
grp.get_group('Maharashtra')
Output:- State Cities Female Population Male Population Total Population literacy_rate_total 0 Maharashtra Nagpur 334934 357959 508852 33.795318 1 Maharashtra Nagpur 262344 470321 362675 34.382807 2 Maharashtra Mumbai 186423 325910 680144 36.866375
Example :-
# selecting a single group
Output:-.
grp = population.groupby(['State','Cities'])
grp.get_group(('Uttar Pradesh', 'Lucknow'))
State Cities Female Population Male Population Total Population literacy_rate_total
3 Uttar Pradesh Lucknow 376081 440642 661941 102.013178
Apply Functions into Group
- Aggregation: It is used to calculate summary statistics of each group category example calculator sum average minimum value
- Transformation: Used to perform some group-specific computation and return a like indexed. EX Fill null value in the group according to the calculated value of group
- Filtration: apply filter function according to the group-wise computation that evaluates as Boolean.Example. Filter out the data according to there group of sum and mean.
Aggregation
Example:- Calculate mean, sum and minimum value of Female population of each state
Code:-
grp = population.groupby('State')
grp['Female Population'].agg([np.sum, np.mean, np.min]) # Pass Select perticulat columns to Calculate there values
Output:-
sum mean amin
State
Madhya Pradesh 778700 259566.666667 186505
Maharashtra 783701 261233.666667 186423
Tamil Nadu 776934 388467.000000 382899
Uttar Pradesh 666343 333171.500000 290262
Example:- Apply different aggregation function to different columns of data frame\
Code:-
# applying a function bypassing
# a list of functions
grp = population.groupby('State')
grp.agg({'Female Population':np.sum,'Male Population': np.sum, 'literacy_rate_total':np.min})
# Pass Select particular columns to Calculate different Aggregation values
Output:-
Female Population Male Population literacy_rate_total
State
Madhya Pradesh 778700 884884 6.824701
Maharashtra 783701 1154190 33.795318
Tamil Nadu 776934 355226 17.705373
Uttar Pradesh 666343 865744 35.262222
Transformation
Transform method Output an object that is indexed the same (same size) as the one each group.
Example:- Perform some group specific computation
Filtration:-
Example:- Filter out the cities which get occurs in two or more time
grp = population.groupby('Cities')
grp.filter(lambda x: len(x) >= 2)
Output:-
State Cities Female Population Male Population Total Population literacy_rate_total
0 Maharashtra Nagpur 334934 357959 508852 33.795318
1 Maharashtra Nagpur 262344 470321 362675 34.382807
6 Madhya Pradesh Indore 186505 381920 628995 26.194098
7 Madhya Pradesh Indore 401469 212018 304758 12.990752
8 Tamil Nadu Chennai 394035 109944 515960 17.705373
9 Tamil Nadu Chennai 382899 245282 482176 27.986555
Conclusion:-
In this blog you will get the better understanding of how to create group of categorical data and how to operate and also perform sum function on this data to get the inference from this groups.