# Some basic package imports
import os
import numpy as np
import pandas as pd
# Visualization packages
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
= 'colab'
pio.renderers.defaule import seaborn as sns
Intermediate Data Science
Data Aggregation and Group Operations
Intermediate Data Science
Important Information
- Email: joanna_bieri@redlands.edu
- Office Hours take place in Duke 209 – Office Hours Schedule
- Class Website
- Syllabus
Data Aggregation and Groups
Applying functions to separate groups of your data can be a critical component of data analysis. Often we have questions about how subgroups of the data differ or we might want to compute pivot tables for reporting or visualization. We are going to get more in depth into the groupby()
function and really see if we can understand what it is doing and what object is returned from it. We will also see how to compute pivot tables and cross-tabulations.
Grouping
When grouping we use split-apply-combine to describe group operations.
- SPLIT - we first split the data based on one or more keys. Think about categorical values in a single column.
- APPLY - now we apply a function to each of the data subsets that we split above.
- COMBINE - finally the results of these functions are combined into a single summary output or result object.
Here is some example data
= pd.DataFrame({"key1" : ["a", "a", None, "b", "b", "a", None],
df "key2" : pd.Series([1, 2, 1, 2, 1, None, 1],
="Int64"),
dtype"data1" : np.random.standard_normal(7),
"data2" : np.random.standard_normal(7)})
df
key1 | key2 | data1 | data2 | |
---|---|---|---|---|
0 | a | 1 | 0.667991 | 0.258252 |
1 | a | 2 | -0.143640 | 0.316452 |
2 | None | 1 | -0.502239 | -0.143485 |
3 | b | 2 | -0.785899 | 0.383478 |
4 | b | 1 | 0.693085 | -1.061101 |
5 | a | <NA> | -0.807092 | -0.394375 |
6 | None | 1 | -1.021366 | -1.936547 |
What if you wanted to calculate the mean of the column data1, but on subsets or groups based on key1.
# First group the data
= df["data1"].groupby(df["key1"])
grouped
# What is this object
grouped
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fede00d34d0>
Notice that we can’t see our data here. This has created a python object that contains all the information we need to apply a function, but it has not actually computed anything yet. We have prepared our split.
Now we will apply a function!
sum() grouped.
key1
a -0.282740
b -0.092813
Name: data1, dtype: float64
We have now applied a function and combined the results into a series.
More Examples
# Calculate the mean of the data in column data1
# Group by two keys
# Results in hierarchical index
"data1"].groupby([df["key1"], df["key2"]]).mean() df[
key1 key2
a 1 0.667991
2 -0.143640
b 1 0.693085
2 -0.785899
Name: data1, dtype: float64
# Calculate the means of both column data1 and data2
# Group by both key1 and key2
"key1"], df["key2"]]).mean() df.groupby([df[
data1 | data2 | ||
---|---|---|---|
key1 | key2 | ||
a | 1 | 0.667991 | 0.258252 |
2 | -0.143640 | 0.316452 | |
b | 1 | 0.693085 | -1.061101 |
2 | -0.785899 | 0.383478 |
# Group by key1 and then ask the size of each group
# This will automatically drop NaNs
'key1').size() df.groupby(
key1
a 3
b 2
dtype: int64
'key1',dropna=False).size() df.groupby(
key1
a 3
b 2
NaN 2
dtype: int64
# Count the number of nonnull values
'key1').count() df.groupby(
key2 | data1 | data2 | |
---|---|---|---|
key1 | |||
a | 2 | 3 | 3 |
b | 2 | 2 | 2 |
You Try
Try applying some other function that are available to the groupby object. See if you can figure out what the function is doing. To see all the possible functions try groupby. and then hit tab!
# Your code here
Iterating over Groups
You can iterate over the group object returned from groupby().
= df["data1"].groupby(df["key1"])
grouped
for g in grouped:
print(g[0])
1]) display(g[
a
b
0 0.667991
1 -0.143640
5 -0.807092
Name: data1, dtype: float64
3 -0.785899
4 0.693085
Name: data1, dtype: float64
You can see that each thing in our grouped object is a tuple. The first entry in the tuple is the group name (or category) the second object is a Series or a DataFrame depending on the number of columns sent in.
= df[["data1","data2"]].groupby(df["key1"])
grouped
for g in grouped:
print(g[0])
1]) display(g[
a
b
data1 | data2 | |
---|---|---|
0 | 0.667991 | 0.258252 |
1 | -0.143640 | 0.316452 |
5 | -0.807092 | -0.394375 |
data1 | data2 | |
---|---|---|
3 | -0.785899 | 0.383478 |
4 | 0.693085 | -1.061101 |
# If you group by two keys, your group names have two values.
= df[["data1","data2"]].groupby([df["key1"],df["key2"]])
grouped
for g in grouped:
print(g[0])
1]) display(g[
('a', np.int64(1))
('a', np.int64(2))
('b', np.int64(1))
('b', np.int64(2))
data1 | data2 | |
---|---|---|
0 | 0.667991 | 0.258252 |
data1 | data2 | |
---|---|---|
1 | -0.14364 | 0.316452 |
data1 | data2 | |
---|---|---|
4 | 0.693085 | -1.061101 |
data1 | data2 | |
---|---|---|
3 | -0.785899 | 0.383478 |
Column vs Row Grouping
By default groups are created on the axis=‘index’ meaning that it is breaking up the rows into groups based on labels in a column. But we could break up columns based on values in a row.
Here is an example where we group the data based on column names.
- We take the transpose of the dataframe.
- Here we send in a dictionary that maps the values found in the index to either key or data.
- The groupby now splits based on our old column names.
df.T
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
key1 | a | a | None | b | b | a | None |
key2 | 1 | 2 | 1 | 2 | 1 | <NA> | 1 |
data1 | 0.667991 | -0.14364 | -0.502239 | -0.785899 | 0.693085 | -0.807092 | -1.021366 |
data2 | 0.258252 | 0.316452 | -0.143485 | 0.383478 | -1.061101 | -0.394375 | -1.936547 |
# Here we map the names found in the index
# To new names - we will group on the new names.
= {"key1": "key", "key2": "key",
mapping "data1": "data", "data2": "data"}
= df.T.groupby(mapping)
grouped
for g in grouped:
print(g[0])
1]) display(g[
data
key
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
data1 | 0.667991 | -0.14364 | -0.502239 | -0.785899 | 0.693085 | -0.807092 | -1.021366 |
data2 | 0.258252 | 0.316452 | -0.143485 | 0.383478 | -1.061101 | -0.394375 | -1.936547 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
key1 | a | a | None | b | b | a | None |
key2 | 1 | 2 | 1 | 2 | 1 | <NA> | 1 |
Grouping with Functions
You can also use python functions to specify groups.
For example, in the data below we have information about different people. What if you wanted to group a data set based on the length of their name. You can do that!
= pd.DataFrame(np.random.standard_normal((5, 5)),
people =["a", "b", "c", "d", "e"],
columns=["Joe", "Steve", "Wanda", "Jill", "Trey"])
index2:3, [1, 2]] = np.nan # Add a few NA values
people.iloc[ people
a | b | c | d | e | |
---|---|---|---|---|---|
Joe | -0.022397 | -0.502856 | -0.004511 | -1.985540 | 0.186960 |
Steve | 0.437526 | 0.467422 | 0.195933 | -0.642973 | 1.054600 |
Wanda | -0.819018 | NaN | NaN | 1.719955 | -0.249791 |
Jill | 0.047468 | 0.051260 | 0.867313 | -0.664157 | -1.172044 |
Trey | 0.636158 | 2.495864 | -1.547932 | -0.117629 | -1.861339 |
= people.groupby(len)
grouped
for g in grouped:
print(g[0])
1]) display(g[
3
4
5
a | b | c | d | e | |
---|---|---|---|---|---|
Joe | -0.022397 | -0.502856 | -0.004511 | -1.98554 | 0.18696 |
a | b | c | d | e | |
---|---|---|---|---|---|
Jill | 0.047468 | 0.051260 | 0.867313 | -0.664157 | -1.172044 |
Trey | 0.636158 | 2.495864 | -1.547932 | -0.117629 | -1.861339 |
a | b | c | d | e | |
---|---|---|---|---|---|
Steve | 0.437526 | 0.467422 | 0.195933 | -0.642973 | 1.054600 |
Wanda | -0.819018 | NaN | NaN | 1.719955 | -0.249791 |
You Try
Read in the macro data - same as last class - just run the code below.
Then group the data by year and apply a function that makes sense with this data. Something like mean, max, min, etc…
Finally plot the resulting time-series data. For example, the average cpi over time.
= pd.read_csv("data/macrodata.csv")
macro = macro[["year","quarter","cpi", "m1", "tbilrate", "unemp"]] data
# Your code here
Data Aggregation
Data aggregation is when you take an array (or list) of values and apply a transformation that produces a single scalar output. Think about a list of grades and then taking an average.
Functions that are optimized for the groupby() method
Aggregation Function | Description | Example Usage |
---|---|---|
sum |
Sum of values (ignores NaN) | df.groupby("col")["val"].sum() |
mean |
Mean (average) of values (ignores NaN) | df.groupby("col")["val"].mean() |
median |
Median (50th percentile) | df.groupby("col")["val"].median() |
min |
Minimum value | df.groupby("col")["val"].min() |
max |
Maximum value | df.groupby("col")["val"].max() |
count |
Count of non-NaN values | df.groupby("col")["val"].count() |
size |
Count of all rows (including NaN) | df.groupby("col").size() |
std |
Standard deviation (ddof=1 by default) | df.groupby("col")["val"].std() |
var |
Variance (ddof=1 by default) | df.groupby("col")["val"].var() |
prod |
Product of values | df.groupby("col")["val"].prod() |
first |
First non-NaN value in group | df.groupby("col")["val"].first() |
last |
Last non-NaN value in group | df.groupby("col")["val"].last() |
nunique |
Number of distinct values | df.groupby("col")["val"].nunique() |
Functions that can be applied to groupby but are a bit slower
Aggregation Function | Description | Example Usage | Notes |
---|---|---|---|
mode |
Most frequent value(s) in group | df.groupby("col")["val"].agg(pd.Series.mode) |
Can return multiple values per group |
skew |
Sample skewness of distribution | df.groupby("col")["val"].skew() |
Slower, especially on large groups |
kurt / kurtosis |
Kurtosis (tailedness of distribution) | df.groupby("col")["val"].kurt() |
Not Cython-optimized |
quantile |
Value at given quantile (e.g. 0.25, 0.75) | df.groupby("col")["val"].quantile(0.25) |
Flexible but slower |
sem |
Standard error of the mean | df.groupby("col")["val"].sem() |
Based on std / sqrt(n) |
mad |
Mean absolute deviation | df.groupby("col")["val"].mad() |
Not Cythonized |
describe |
Multiple summary stats at once | df.groupby("col")["val"].describe() |
Returns count, mean, std, min, etc. |
Custom functions | Any user-defined aggregation (via .apply ) |
df.groupby("col")["val"].apply(func) |
Flexible but usually much slower |
Now sometimes you want to apply muptliple functions to a single grouped object. One way to do this is with the .agg()
function. Lets read in the tips data we have seen before:
= pd.read_csv("data/tips.csv")
tips tips.head()
total_bill | tip | smoker | day | time | size | |
---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | No | Sun | Dinner | 4 |
Now lets say that we want to understand the tip percentage from different groups: smokers vs nonsmokers, vs day of the week. Maybe you want to know the average and standard deviation of the tip percentage and you want to define a calculation of your own called max_to_min that calculates the difference between the max and min percent.
# First we have to add the percentage to the data
"tip_pct"] = tips["tip"] / tips["total_bill"] tips[
# Now we can define our own custom function
def max_to_min(arr):
return arr.max() - arr.min()
# Next lets group the data
= tips.groupby(["day", "smoker"]) grouped
# And finally apply the functions using the .agg()
= ["mean", "std", max_to_min]
functions = grouped['tip_pct'].agg(functions)
agg_data agg_data
mean | std | max_to_min | ||
---|---|---|---|---|
day | smoker | |||
Fri | No | 0.151650 | 0.028123 | 0.067349 |
Yes | 0.174783 | 0.051293 | 0.159925 | |
Sat | No | 0.158048 | 0.039767 | 0.235193 |
Yes | 0.147906 | 0.061375 | 0.290095 | |
Sun | No | 0.160113 | 0.042347 | 0.193226 |
Yes | 0.187250 | 0.154134 | 0.644685 | |
Thur | No | 0.160298 | 0.038774 | 0.193350 |
Yes | 0.163863 | 0.039389 | 0.151240 |
This is a case where I would use pandas .plot()
# Graph the results
agg_data.plot.bar()
plt.grid() plt.show()
# You can also return the data without the groups as and index
# just use the as_index=False command
= tips.groupby(["day", "smoker"],as_index=False)
grouped2 = ["mean", "std", max_to_min]
functions = grouped2['tip_pct'].agg(functions)
agg_data2 agg_data2
day | smoker | mean | std | max_to_min | |
---|---|---|---|---|---|
0 | Fri | No | 0.151650 | 0.028123 | 0.067349 |
1 | Fri | Yes | 0.174783 | 0.051293 | 0.159925 |
2 | Sat | No | 0.158048 | 0.039767 | 0.235193 |
3 | Sat | Yes | 0.147906 | 0.061375 | 0.290095 |
4 | Sun | No | 0.160113 | 0.042347 | 0.193226 |
5 | Sun | Yes | 0.187250 | 0.154134 | 0.644685 |
6 | Thur | No | 0.160298 | 0.038774 | 0.193350 |
7 | Thur | Yes | 0.163863 | 0.039389 | 0.151240 |
From this example hopefully you can see how powerful grouping and aggregating can be in quickly comparing parts of your dataset!
You Try
What happens when you run the following command using the tip data above? Can you predict what the output will be before you run the code? Then run the code and explain the results. Finally make a plot - your choice on the type.
grouped = tips.groupby(["day", "smoker","time"])
functions = ["count", "mean", "max"]
result = grouped[["tip_pct", "total_bill"]].agg(functions)
result
# Your code here
Quantile and Bucket Analysis
We have seen some functions that can help us to cut our data into buckets, pd.cut()
for example. Another similar function is pd.qcut()
which divides your data into sample quantiles. You can use these functions on grouped data.
Below we will use .qcut()
to get quartile categories and then group by the quartiles. Finally we can aggregate over that data.
= pd.qcut(tips['total_bill'],4)
quartiles 'quartiles'] = quartiles
tips[ tips.head()
total_bill | tip | smoker | day | time | size | tip_pct | quartiles | |
---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | No | Sun | Dinner | 2 | 0.059447 | (13.348, 17.795] |
1 | 10.34 | 1.66 | No | Sun | Dinner | 3 | 0.160542 | (3.069, 13.348] |
2 | 21.01 | 3.50 | No | Sun | Dinner | 3 | 0.166587 | (17.795, 24.127] |
3 | 23.68 | 3.31 | No | Sun | Dinner | 2 | 0.139780 | (17.795, 24.127] |
4 | 24.59 | 3.61 | No | Sun | Dinner | 4 | 0.146808 | (24.127, 50.81] |
= tips['total_bill'].groupby(tips['quartiles'])
grouped "count","max","min","mean"]) grouped.agg([
count | max | min | mean | |
---|---|---|---|---|
quartiles | ||||
(3.069, 13.348] | 61 | 13.28 | 3.07 | 10.691967 |
(13.348, 17.795] | 61 | 17.78 | 13.37 | 15.618689 |
(17.795, 24.127] | 61 | 24.08 | 17.81 | 20.498525 |
(24.127, 50.81] | 61 | 50.81 | 24.27 | 32.334590 |
More Examples of using groupby()
Filling Missing Values
= ["Ohio", "New York", "Vermont", "Florida",
states "Oregon", "Nevada", "California", "Idaho"]
= ["East", "East", "East", "East",
location "West", "West", "West", "West"]
= pd.DataFrame({'Data':np.random.standard_normal(8),
data 'Location':location},
=states)
index
# Add some NaNs
for s in ["Vermont", "Nevada", "Idaho"]:
"Data"] = np.nan
data.loc[s, data
Data | Location | |
---|---|---|
Ohio | 0.849264 | East |
New York | 0.083113 | East |
Vermont | NaN | East |
Florida | 1.207980 | East |
Oregon | -0.085424 | West |
Nevada | NaN | West |
California | -0.739081 | West |
Idaho | NaN | West |
What if we wanted to fill these NaNs, but with different values based on whether they are in the east or west. Maybe we want to average the east and west values and use that number to fill the NaNs.
# Define a function that will return the fill value
def fill_mean(group):
return group.fillna(group.mean())
# We group by location
= data.groupby('Location')
grouped # Then fill in the NANs with the averages of the group
apply(fill_mean) grouped.
Data | ||
---|---|---|
Location | ||
East | Ohio | 0.849264 |
New York | 0.083113 | |
Vermont | 0.713452 | |
Florida | 1.207980 | |
West | Oregon | -0.085424 |
Nevada | -0.412252 | |
California | -0.739081 | |
Idaho | -0.412252 |
Groupwise Linear Regression
We talked about linear regression in DATA101 and DATA100. It is a way to fit a straight line to some given data.
We will use sklearn to do this. If you don’t have it installed, you should install it.
# !conda install -y scikit-learn
# Get some data
= pd.read_csv("data/stock_px.csv", parse_dates=True,
close_px =0)
index_col
# The .info() command gives you information about the data
# Including data types and number of nonnulls
close_px.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AAPL 2214 non-null float64
1 MSFT 2214 non-null float64
2 XOM 2214 non-null float64
3 SPX 2214 non-null float64
dtypes: float64(4)
memory usage: 86.5 KB
close_px.head()
AAPL | MSFT | XOM | SPX | |
---|---|---|---|---|
2003-01-02 | 7.40 | 21.11 | 29.22 | 909.03 |
2003-01-03 | 7.45 | 21.14 | 29.24 | 908.59 |
2003-01-06 | 7.45 | 21.52 | 29.96 | 929.01 |
2003-01-07 | 7.43 | 21.93 | 28.95 | 922.93 |
2003-01-08 | 7.28 | 21.31 | 28.83 | 909.93 |
# Lets group the data by year
# First confirm that the index values are dates
= close_px.index[0]
first_index print(first_index)
print(type(first_index))
2003-01-02 00:00:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
# This is a timestamp object
# You can do first_index. and press tab to see all the options
print(first_index.year)
print(first_index.month)
print(first_index.month_name())
print(first_index.day)
print(first_index.day_name())
2003
1
January
2
Thursday
# So first write a function that will return the year
def get_year(x):
return x.year
# Then group by the year
= close_px.groupby(get_year) by_year
from sklearn.linear_model import LinearRegression
# Define a liner regression and return intercept and slope
def regress(data,xvars,yvar):
= data[xvars]
X = data[yvar]
y = LinearRegression()
LM
LM.fit(X, y)= LM.coef_[0]
slope = LM.intercept_
intercept return intercept, slope
# Apply the linear regression to the groups
apply(regress,yvar='AAPL',xvars=['SPX']) by_year.
2003 (-8.58984966958583, 0.018505966710274123)
2004 (-132.12116289682774, 0.1325654494610963)
2005 (-299.5951029082234, 0.28683118762773924)
2006 (-133.18691255199056, 0.15566846429728873)
2007 (-432.9175504440354, 0.37990617598041215)
2008 (-36.33814322777991, 0.1461565642803874)
2009 (-164.38487241406185, 0.3282529241664158)
2010 (-210.84981430613925, 0.41290045011855553)
2011 (603.8769113704016, -0.1938338932016932)
dtype: object
Pivot Tables and Cross-Tabulation
Pivot Table
Pivot tables are often found in spreadsheet programs and are a way to summarize data. We have seen the .pivot
operation as a way to wrangle the data. Here we will look at the pivot_table()
method. The results in many cases can be produced using the groupby function, but this acts as a shortcut and can add partial totals or margins to the data.
# Lets use the same tips data that we loaded in above
= pd.read_csv("data/tips.csv")
tips "tip_pct"] = tips["tip"] / tips["total_bill"]
tips[ tips.head()
total_bill | tip | smoker | day | time | size | tip_pct | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | No | Sun | Dinner | 2 | 0.059447 |
1 | 10.34 | 1.66 | No | Sun | Dinner | 3 | 0.160542 |
2 | 21.01 | 3.50 | No | Sun | Dinner | 3 | 0.166587 |
3 | 23.68 | 3.31 | No | Sun | Dinner | 2 | 0.139780 |
4 | 24.59 | 3.61 | No | Sun | Dinner | 4 | 0.146808 |
=["day", "smoker"],
tips.pivot_table(index=["size", "tip", "tip_pct", "total_bill"]) values
size | tip | tip_pct | total_bill | ||
---|---|---|---|---|---|
day | smoker | ||||
Fri | No | 2.250000 | 2.812500 | 0.151650 | 18.420000 |
Yes | 2.066667 | 2.714000 | 0.174783 | 16.813333 | |
Sat | No | 2.555556 | 3.102889 | 0.158048 | 19.661778 |
Yes | 2.476190 | 2.875476 | 0.147906 | 21.276667 | |
Sun | No | 2.929825 | 3.167895 | 0.160113 | 20.506667 |
Yes | 2.578947 | 3.516842 | 0.187250 | 24.120000 | |
Thur | No | 2.488889 | 2.673778 | 0.160298 | 17.113111 |
Yes | 2.352941 | 3.030000 | 0.163863 | 19.190588 |
NOTE by default the pivot table returns the mean()
We could have done the same operation with groupby!
Arguments you might want to pass into the pivot_table() function:
- index – the values that you are grouping by
- values – the numbers you are aggregating
- columns – categories to subset the columns - adding extra columns to the output
- margin=True – include the margin or the value for the whole
- aggfunc – aggregation function if you want something other than mean.
- fill_value – what you want to full if the computation runs into a NaN
Here are some examples:
=["time", "day"], columns="smoker",
tips.pivot_table(index=["tip_pct", "size"], margins=True) values
size | tip_pct | ||||||
---|---|---|---|---|---|---|---|
smoker | No | Yes | All | No | Yes | All | |
time | day | ||||||
Dinner | Fri | 2.000000 | 2.222222 | 2.166667 | 0.139622 | 0.165347 | 0.158916 |
Sat | 2.555556 | 2.476190 | 2.517241 | 0.158048 | 0.147906 | 0.153152 | |
Sun | 2.929825 | 2.578947 | 2.842105 | 0.160113 | 0.187250 | 0.166897 | |
Thur | 2.000000 | NaN | 2.000000 | 0.159744 | NaN | 0.159744 | |
Lunch | Fri | 3.000000 | 1.833333 | 2.000000 | 0.187735 | 0.188937 | 0.188765 |
Thur | 2.500000 | 2.352941 | 2.459016 | 0.160311 | 0.163863 | 0.161301 | |
All | 2.668874 | 2.408602 | 2.569672 | 0.159328 | 0.163196 | 0.160803 |
=["time", "day"], columns="smoker",
tips.pivot_table(index=["tip_pct", "size"], aggfunc="max") values
size | tip_pct | ||||
---|---|---|---|---|---|
smoker | No | Yes | No | Yes | |
time | day | ||||
Dinner | Fri | 2.0 | 4.0 | 0.155625 | 0.263480 |
Sat | 4.0 | 5.0 | 0.291990 | 0.325733 | |
Sun | 6.0 | 5.0 | 0.252672 | 0.710345 | |
Thur | 2.0 | NaN | 0.159744 | NaN | |
Lunch | Fri | 3.0 | 2.0 | 0.187735 | 0.259314 |
Thur | 6.0 | 4.0 | 0.266312 | 0.241255 |
=["time", "size", "smoker"], columns="day",
tips.pivot_table(index="tip_pct") values
day | Fri | Sat | Sun | Thur | ||
---|---|---|---|---|---|---|
time | size | smoker | ||||
Dinner | 1 | No | NaN | 0.137931 | NaN | NaN |
Yes | NaN | 0.325733 | NaN | NaN | ||
2 | No | 0.139622 | 0.162705 | 0.168859 | 0.159744 | |
Yes | 0.171297 | 0.148668 | 0.207893 | NaN | ||
3 | No | NaN | 0.154661 | 0.152663 | NaN | |
Yes | NaN | 0.144995 | 0.152660 | NaN | ||
4 | No | NaN | 0.150096 | 0.148143 | NaN | |
Yes | 0.117750 | 0.124515 | 0.193370 | NaN | ||
5 | No | NaN | NaN | 0.206928 | NaN | |
Yes | NaN | 0.106572 | 0.065660 | NaN | ||
6 | No | NaN | NaN | 0.103799 | NaN | |
Lunch | 1 | No | NaN | NaN | NaN | 0.181728 |
Yes | 0.223776 | NaN | NaN | NaN | ||
2 | No | NaN | NaN | NaN | 0.166005 | |
Yes | 0.181969 | NaN | NaN | 0.158843 | ||
3 | No | 0.187735 | NaN | NaN | 0.084246 | |
Yes | NaN | NaN | NaN | 0.204952 | ||
4 | No | NaN | NaN | NaN | 0.138919 | |
Yes | NaN | NaN | NaN | 0.155410 | ||
5 | No | NaN | NaN | NaN | 0.121389 | |
6 | No | NaN | NaN | NaN | 0.173706 |
=["time", "size", "smoker"], columns="day",
tips.pivot_table(index="tip_pct",fill_value=0) values
day | Fri | Sat | Sun | Thur | ||
---|---|---|---|---|---|---|
time | size | smoker | ||||
Dinner | 1 | No | 0.000000 | 0.137931 | 0.000000 | 0.000000 |
Yes | 0.000000 | 0.325733 | 0.000000 | 0.000000 | ||
2 | No | 0.139622 | 0.162705 | 0.168859 | 0.159744 | |
Yes | 0.171297 | 0.148668 | 0.207893 | 0.000000 | ||
3 | No | 0.000000 | 0.154661 | 0.152663 | 0.000000 | |
Yes | 0.000000 | 0.144995 | 0.152660 | 0.000000 | ||
4 | No | 0.000000 | 0.150096 | 0.148143 | 0.000000 | |
Yes | 0.117750 | 0.124515 | 0.193370 | 0.000000 | ||
5 | No | 0.000000 | 0.000000 | 0.206928 | 0.000000 | |
Yes | 0.000000 | 0.106572 | 0.065660 | 0.000000 | ||
6 | No | 0.000000 | 0.000000 | 0.103799 | 0.000000 | |
Lunch | 1 | No | 0.000000 | 0.000000 | 0.000000 | 0.181728 |
Yes | 0.223776 | 0.000000 | 0.000000 | 0.000000 | ||
2 | No | 0.000000 | 0.000000 | 0.000000 | 0.166005 | |
Yes | 0.181969 | 0.000000 | 0.000000 | 0.158843 | ||
3 | No | 0.187735 | 0.000000 | 0.000000 | 0.084246 | |
Yes | 0.000000 | 0.000000 | 0.000000 | 0.204952 | ||
4 | No | 0.000000 | 0.000000 | 0.000000 | 0.138919 | |
Yes | 0.000000 | 0.000000 | 0.000000 | 0.155410 | ||
5 | No | 0.000000 | 0.000000 | 0.000000 | 0.121389 | |
6 | No | 0.000000 | 0.000000 | 0.000000 | 0.173706 |
Crosstab
Cross tabulation is a type of pivot table that returns frequency observations. You can very quickly reach into your data and get counts of the number of observations that fall into each subset.
= pd.crosstab([tips["time"], tips["day"]], tips["smoker"])
cdata cdata
smoker | No | Yes | |
---|---|---|---|
time | day | ||
Dinner | Fri | 3 | 9 |
Sat | 45 | 42 | |
Sun | 57 | 19 | |
Thur | 1 | 0 | |
Lunch | Fri | 1 | 6 |
Thur | 44 | 17 |
cdata.plot.bar()
plt.grid()'Customer Count')
plt.ylabel( plt.show()