graph LR subgraph Data Science Initial Exploration A((Subject Mater Understanding <br> Define the Question))-->B((Data Gathering <br> Mining <br> Ethical Qusetions )) B-->C((Data Cleaning <br> Wrangling)) C-->D((Data Exploration <br> Visualization <br> Exploratory Data Analysis)) end subgraph Data Science Modeling D-->E((Feature Engineering <br> Data Preparation)) E-->F((Predictive Modeling)) F-->G((Data Visualization <br> Communication <br> Sharing)) end G-->A style C fill:#f9f,stroke:#333,stroke-width:4px
Introduction to Data Science
Data Wrangling Continued
Important Information
- Email: joanna_bieri@redlands.edu
- Office Hours: Duke 209 Click Here for Joanna’s Schedule
Announcements
Please come to office hours to get help!
Day 6 Assignment - same drill.
- Make sure Pull any new content from the class repo - then Copy it over into your working diretory.
- Open the file Day3-HW.ipynb and start doing the problems.
- You can do these problems as you follow along with the lecture notes and video.
- Get as far as you can before class.
- Submit what you have so far Commit and Push to Git.
- Take the daily check in quiz on Canvas.
- Come to class with lots of questions!
—————————–
The Data Science Lifecycle
Today we will continue talking about Data Wrangling.
Last Class - Definitions
Make sure you can define and come up with examples for
- Raw Data
- Tidy Data
- Displayed Data
- Summarized Data
Last Class - Code
We learned how to select subsets of the data and sort the data. You should be able to:
- Find all the column names and print them.
- Slice the data by selecting a subset of the columns
- Sort the data either ascending or descending
- Mask the data using logical operators
This class
Our goal is to continue learning commands that helps us retrieve pieces of the data. Today we will:
- Learn how to do more advanced masking
- Find count unique values with value_counts() and drop_duplicates()
- Add new columns
- Use group_by() to do statistics on groups
This lab follows the Data Science in a Box lab “Working with a single data frame” by Mine Çetinkaya-Rundel. The content has been updated for our class and translated to Python by Joanna Bieri.
Load the Data and look at the basic info
Same data as last class!
- Data from two hotels: one resort and one city hotel
- Observations: Each row represents a hotel booking
- Goal for original data collection: Development of prediction models to classify a hotel booking’s likelihood to be canceled Antonia et al., 2019
Data Information: Tidy Tuesday
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
= 'colab'
pio.renderers.defaule
from itables import show
= 'https://joannabieri.com/introdatascience/data/hotels.csv'
file_location = pd.read_csv(file_location) DF_raw_hotels
show(DF_raw_hotels)
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.1.4 from the internet...
(need help?) |
# I always look at the column names - they have to be spelled exactly!
= list(DF_raw_hotels.keys())
columns_list print(columns_list)
['hotel', 'is_canceled', 'lead_time', 'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'meal', 'country', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', 'agent', 'company', 'days_in_waiting_list', 'customer_type', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'reservation_status', 'reservation_status_date']
More advanced boolean masks - combining masks
Last time we learned about boolean masks. We were able to use a mask to only show certain rows or columns in the data.
Basic Operators
Operator | Definition |
---|---|
< | less than |
> | greater than |
<= | less than or equal to |
>= | greater than or equal to |
== | exactly equal to |
!= | not equal to |
Advanced Operators
Operator | Definition |
---|---|
and | check if two things are both true |
or | check if one of two things is true |
in | checks if something is in another thing |
! | not checks if something is false |
In an effort to keep things simple, we learned how to apply these masks one at a time.
Let’s answer the question:
- Show the results for Resort Hotel where the reservation was canceled (is_canceled=1)
= (DF_raw_hotels['hotel']=='Resort Hotel')
mask1 =DF_raw_hotels[mask1]
DF_hotels_mask= (DF_hotels_mask['is_canceled']==1)
mask2 = DF_hotels_mask[mask2]
DF_final DF_final
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | Resort Hotel | 1 | 85 | 2015 | July | 27 | 1 | 0 | 3 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 82.00 | 0 | 1 | Canceled | 2015-05-06 |
9 | Resort Hotel | 1 | 75 | 2015 | July | 27 | 1 | 0 | 3 | 2 | ... | No Deposit | 15.0 | NaN | 0 | Transient | 105.50 | 0 | 0 | Canceled | 2015-04-22 |
10 | Resort Hotel | 1 | 23 | 2015 | July | 27 | 1 | 0 | 4 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 123.00 | 0 | 0 | Canceled | 2015-06-23 |
27 | Resort Hotel | 1 | 60 | 2015 | July | 27 | 1 | 2 | 5 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 107.00 | 0 | 2 | Canceled | 2015-05-11 |
32 | Resort Hotel | 1 | 96 | 2015 | July | 27 | 1 | 2 | 8 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 108.30 | 0 | 2 | Canceled | 2015-05-29 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32682 | Resort Hotel | 1 | 0 | 2017 | March | 12 | 20 | 1 | 0 | 1 | ... | No Deposit | NaN | 88.0 | 0 | Transient | 35.00 | 0 | 0 | Canceled | 2017-03-20 |
32684 | Resort Hotel | 1 | 0 | 2017 | March | 12 | 20 | 1 | 0 | 1 | ... | No Deposit | NaN | 88.0 | 0 | Transient | 35.00 | 0 | 0 | Canceled | 2017-03-20 |
32835 | Resort Hotel | 1 | 20 | 2017 | February | 6 | 7 | 0 | 1 | 2 | ... | No Deposit | 250.0 | NaN | 0 | Transient | 45.00 | 0 | 1 | Canceled | 2017-02-07 |
34707 | Resort Hotel | 1 | 0 | 2017 | August | 34 | 24 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 190.00 | 0 | 0 | Canceled | 2017-08-24 |
36837 | Resort Hotel | 1 | 258 | 2017 | May | 22 | 31 | 2 | 4 | 2 | ... | No Deposit | 143.0 | NaN | 0 | Transient | 68.95 | 0 | 1 | No-Show | 2017-05-31 |
11122 rows × 32 columns
Example of Combining Masks
Once you understand the above code…
You can merge this command into one line using:
Operator | Definition |
---|---|
& | check if two things are both true (bitwise = works on multiple values) |
| | check if one of two things is true (bitwise = works on multiple values) |
The code below does the exact same operation - combining the two conditions into just one mask
# Mask hotel='Resort Hotel' and 'is_canceled'=1
= (DF_raw_hotels['hotel']=='Resort Hotel') & (DF_raw_hotels['is_canceled']==1)
mask DF_raw_hotels[mask]
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | Resort Hotel | 1 | 85 | 2015 | July | 27 | 1 | 0 | 3 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 82.00 | 0 | 1 | Canceled | 2015-05-06 |
9 | Resort Hotel | 1 | 75 | 2015 | July | 27 | 1 | 0 | 3 | 2 | ... | No Deposit | 15.0 | NaN | 0 | Transient | 105.50 | 0 | 0 | Canceled | 2015-04-22 |
10 | Resort Hotel | 1 | 23 | 2015 | July | 27 | 1 | 0 | 4 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 123.00 | 0 | 0 | Canceled | 2015-06-23 |
27 | Resort Hotel | 1 | 60 | 2015 | July | 27 | 1 | 2 | 5 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 107.00 | 0 | 2 | Canceled | 2015-05-11 |
32 | Resort Hotel | 1 | 96 | 2015 | July | 27 | 1 | 2 | 8 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 108.30 | 0 | 2 | Canceled | 2015-05-29 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32682 | Resort Hotel | 1 | 0 | 2017 | March | 12 | 20 | 1 | 0 | 1 | ... | No Deposit | NaN | 88.0 | 0 | Transient | 35.00 | 0 | 0 | Canceled | 2017-03-20 |
32684 | Resort Hotel | 1 | 0 | 2017 | March | 12 | 20 | 1 | 0 | 1 | ... | No Deposit | NaN | 88.0 | 0 | Transient | 35.00 | 0 | 0 | Canceled | 2017-03-20 |
32835 | Resort Hotel | 1 | 20 | 2017 | February | 6 | 7 | 0 | 1 | 2 | ... | No Deposit | 250.0 | NaN | 0 | Transient | 45.00 | 0 | 1 | Canceled | 2017-02-07 |
34707 | Resort Hotel | 1 | 0 | 2017 | August | 34 | 24 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 190.00 | 0 | 0 | Canceled | 2017-08-24 |
36837 | Resort Hotel | 1 | 258 | 2017 | May | 22 | 31 | 2 | 4 | 2 | ... | No Deposit | 143.0 | NaN | 0 | Transient | 68.95 | 0 | 1 | No-Show | 2017-05-31 |
11122 rows × 32 columns
Another Example of Combining Masks
This time we will ask:
- Show only results for visitors from USA or GBR.
= (DF_raw_hotels['country']=='USA') | (DF_raw_hotels['country']=='GBR')
mask DF_raw_hotels[mask]
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 75.00 | 0 | 0 | Check-Out | 2015-07-02 |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | NaN | 0 | Transient | 75.00 | 0 | 0 | Check-Out | 2015-07-02 |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.00 | 0 | 1 | Check-Out | 2015-07-03 |
5 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.00 | 0 | 1 | Check-Out | 2015-07-03 |
12 | Resort Hotel | 0 | 68 | 2015 | July | 27 | 1 | 0 | 4 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 97.00 | 0 | 3 | Check-Out | 2015-07-05 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
119358 | City Hotel | 0 | 135 | 2017 | August | 35 | 30 | 1 | 4 | 2 | ... | No Deposit | 14.0 | NaN | 0 | Transient | 143.10 | 0 | 1 | Check-Out | 2017-09-04 |
119369 | City Hotel | 0 | 207 | 2017 | August | 34 | 25 | 3 | 7 | 2 | ... | No Deposit | 14.0 | NaN | 0 | Transient | 96.25 | 0 | 2 | Check-Out | 2017-09-04 |
119378 | City Hotel | 0 | 247 | 2017 | August | 35 | 31 | 1 | 3 | 2 | ... | No Deposit | 42.0 | NaN | 0 | Transient | 86.85 | 0 | 0 | Check-Out | 2017-09-04 |
119379 | City Hotel | 0 | 109 | 2017 | August | 35 | 31 | 1 | 3 | 2 | ... | No Deposit | 9.0 | NaN | 0 | Transient | 155.00 | 0 | 1 | Check-Out | 2017-09-04 |
119388 | City Hotel | 0 | 109 | 2017 | August | 35 | 31 | 2 | 5 | 2 | ... | No Deposit | 89.0 | NaN | 0 | Transient | 104.40 | 0 | 0 | Check-Out | 2017-09-07 |
14226 rows × 32 columns
Finding unique values
Often we want to see how many different categories are contained in a categorical column. This lets us see what the choices are for that entry.
Unique in one column
For example, in the hotel data, maybe we want to see what possible values the column ‘market_segment’ can take. Sometimes this helps us to understand a column and it always lets us see what values are in the data set. We will use .drop_duplicates() to see the unique values.
NOTICE - I am building from things we learned in our past classes. Here are I selecting just one column
my_columns = ['market_segment']
then grabbing that column of the data frame
DF_raw_hotels[my_columns]
then applying the drop_duplicates() command
DF_raw_hotels[my_columns].drop_duplicates()
= ['market_segment']
my_columns DF_raw_hotels[my_columns].drop_duplicates()
market_segment | |
---|---|
0 | Direct |
3 | Corporate |
4 | Online TA |
9 | Offline TA/TO |
125 | Complementary |
413 | Groups |
40600 | Undefined |
49013 | Aviation |
Unique across two columns
We can expand the number of columns we care about and the command .drop_duplicates() will look at BOTH columns for duplicates. In the code below you see that we see duplicates if we read down just one column, but there are no duplicates if we consider both columns.
= ['market_segment','customer_type']
my_columns DF_raw_hotels[my_columns].drop_duplicates()
market_segment | customer_type | |
---|---|---|
0 | Direct | Transient |
3 | Corporate | Transient |
4 | Online TA | Transient |
9 | Offline TA/TO | Transient |
16 | Offline TA/TO | Contract |
47 | Offline TA/TO | Transient-Party |
125 | Complementary | Transient |
127 | Online TA | Transient-Party |
260 | Direct | Transient-Party |
413 | Groups | Transient-Party |
417 | Groups | Transient |
539 | Online TA | Group |
1530 | Corporate | Transient-Party |
1539 | Direct | Group |
1587 | Offline TA/TO | Group |
2262 | Corporate | Contract |
2263 | Direct | Contract |
2698 | Groups | Contract |
9511 | Complementary | Transient-Party |
14051 | Complementary | Group |
14654 | Groups | Group |
15141 | Corporate | Group |
15261 | Complementary | Contract |
15506 | Online TA | Contract |
40600 | Undefined | Transient-Party |
49013 | Aviation | Transient-Party |
49372 | Aviation | Transient |
112102 | Aviation | Group |
Counting unique values in one column
Sometimes instead of a list of unique values we want to see counts of how often those labels were seen in the data. This is called: Frequency, Value Counts, Frequency Counts, etc.
In Python we can use the .value_counts() command. We have seen this before.
= ['market_segment']
my_columns DF_raw_hotels[my_columns].value_counts()
market_segment
Online TA 56477
Offline TA/TO 24219
Groups 19811
Direct 12606
Corporate 5295
Complementary 743
Aviation 237
Undefined 2
Name: count, dtype: int64
Counting unique values in two columns
BUT did you know that you can do these kinds of counts in more than one column!?!? All you have to do is select multiple columns and the value count will happen across those columns.
Notice below that now we separate Online TA into four subgroups:
Online TA Transient 51299
Online TA Transient-Party 3513
Online TA Contract 1486
Online TA Group 179
= ['market_segment','customer_type']
my_columns DF_raw_hotels[my_columns].value_counts()
market_segment customer_type
Online TA Transient 51299
Offline TA/TO Transient 14054
Direct Transient 11336
Groups Transient-Party 10633
Transient 8427
Offline TA/TO Transient-Party 8137
Corporate Transient 3576
Online TA Transient-Party 3513
Offline TA/TO Contract 1817
Corporate Transient-Party 1668
Online TA Contract 1486
Direct Transient-Party 1122
Groups Contract 735
Complementary Transient 703
Aviation Transient 218
Offline TA/TO Group 211
Online TA Group 179
Direct Group 134
Complementary Transient-Party 32
Corporate Group 29
Contract 22
Aviation Transient-Party 17
Groups Group 16
Direct Contract 14
Complementary Group 6
Aviation Group 2
Complementary Contract 2
Undefined Transient-Party 2
Name: count, dtype: int64
Counting unique values - Unsorted
Sometimes the way that .value_counts() sorts the data in descending order, actually makes the data harder to read! You can use the flag sort=False to stop this sorting.
= ['market_segment','customer_type']
my_columns =False) DF_raw_hotels[my_columns].value_counts(sort
market_segment customer_type
Aviation Group 2
Transient 218
Transient-Party 17
Complementary Contract 2
Group 6
Transient 703
Transient-Party 32
Corporate Contract 22
Group 29
Transient 3576
Transient-Party 1668
Direct Contract 14
Group 134
Transient 11336
Transient-Party 1122
Groups Contract 735
Group 16
Transient 8427
Transient-Party 10633
Offline TA/TO Contract 1817
Group 211
Transient 14054
Transient-Party 8137
Online TA Contract 1486
Group 179
Transient 51299
Transient-Party 3513
Undefined Transient-Party 2
Name: count, dtype: int64
Counting unique values - Ascending
You can also ask to sort the data ascending with ascending=True
= ['market_segment','customer_type']
my_columns =True) DF_raw_hotels[my_columns].value_counts(ascending
market_segment customer_type
Aviation Group 2
Complementary Contract 2
Undefined Transient-Party 2
Complementary Group 6
Direct Contract 14
Groups Group 16
Aviation Transient-Party 17
Corporate Contract 22
Group 29
Complementary Transient-Party 32
Direct Group 134
Online TA Group 179
Offline TA/TO Group 211
Aviation Transient 218
Complementary Transient 703
Groups Contract 735
Direct Transient-Party 1122
Online TA Contract 1486
Corporate Transient-Party 1668
Offline TA/TO Contract 1817
Online TA Transient-Party 3513
Corporate Transient 3576
Offline TA/TO Transient-Party 8137
Groups Transient 8427
Transient-Party 10633
Direct Transient 11336
Offline TA/TO Transient 14054
Online TA Transient 51299
Name: count, dtype: int64
Saving Unique Values to a Data Frame
You see that the code above returns something that is not so pretty to look at, and for me not as easy to work with. So most of the time you will want to convert the results of a value count back into a data frame. Here is a break down of what this code does:
First we do the value counts like normal and then save the information in the variable my_counts
my_columns = ['market_segment','customer_type'] my_counts = DF_raw_hotels[my_columns].value_counts()
Then we do something call a rest and rename. To do this we add the command .reset_index().rename(columns={“index”: “value”, 0: “count”}). In this class you can just copy and past this after the variable name.
my_counts = my_counts.reset_index().rename(columns={"index": "value", 0: "count"})
Now you will see that when we show the value counts they look much nicer AND we will be able to do calculations with these columns.
= ['market_segment','customer_type']
my_columns = DF_raw_hotels[my_columns].value_counts()
my_counts = my_counts.reset_index().rename(columns={"index": "value", 0: "count"})
my_counts show(my_counts)
market_segment | customer_type | count |
---|---|---|
Loading ITables v2.1.4 from the internet...
(need help?) |
Adding a column to a Data Frame
In many cases your data will have information across multiple columns that you will want to combine. You can do this by adding a new column to your data frame.
In the raw data there are two columns ‘children’ and ‘babies’, but what if I actually wanted to know the total number of little ones that families were traveling with. In other words, I wanted to combine the information about children and babies (add it up). We do this my adding a new column!
First you call your data frame and put in a name that is not already in the columns:
DF_raw_hotels['little_ones']
Then you set that equal to the calculation you want to do
= DF_raw_hotels['children'] + DF_raw_hotels['babies']
This will save the new calculation in a column with the name ‘little_ones’
Scroll all the way to the right and you will see the new variable (column)
'little_ones'] = DF_raw_hotels['children'] + DF_raw_hotels['babies'] DF_raw_hotels[
DF_raw_hotels
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | little_ones | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | NaN | NaN | 0 | Transient | 0.00 | 0 | 0 | Check-Out | 2015-07-01 | 0.0 |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | NaN | NaN | 0 | Transient | 0.00 | 0 | 0 | Check-Out | 2015-07-01 | 0.0 |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | NaN | NaN | 0 | Transient | 75.00 | 0 | 0 | Check-Out | 2015-07-02 | 0.0 |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | 304.0 | NaN | 0 | Transient | 75.00 | 0 | 0 | Check-Out | 2015-07-02 | 0.0 |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | 240.0 | NaN | 0 | Transient | 98.00 | 0 | 1 | Check-Out | 2015-07-03 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
119385 | City Hotel | 0 | 23 | 2017 | August | 35 | 30 | 2 | 5 | 2 | ... | 394.0 | NaN | 0 | Transient | 96.14 | 0 | 0 | Check-Out | 2017-09-06 | 0.0 |
119386 | City Hotel | 0 | 102 | 2017 | August | 35 | 31 | 2 | 5 | 3 | ... | 9.0 | NaN | 0 | Transient | 225.43 | 0 | 2 | Check-Out | 2017-09-07 | 0.0 |
119387 | City Hotel | 0 | 34 | 2017 | August | 35 | 31 | 2 | 5 | 2 | ... | 9.0 | NaN | 0 | Transient | 157.71 | 0 | 4 | Check-Out | 2017-09-07 | 0.0 |
119388 | City Hotel | 0 | 109 | 2017 | August | 35 | 31 | 2 | 5 | 2 | ... | 89.0 | NaN | 0 | Transient | 104.40 | 0 | 0 | Check-Out | 2017-09-07 | 0.0 |
119389 | City Hotel | 0 | 205 | 2017 | August | 35 | 29 | 2 | 7 | 2 | ... | 9.0 | NaN | 0 | Transient | 151.20 | 0 | 2 | Check-Out | 2017-09-07 | 0.0 |
119390 rows × 33 columns
Q Count up the number of little_ones in different hotel types (Hint - use value_counts() with hotel and little_ones as your columns.)
Q Add a new column that calculates the total number of humans (all_humans) in the room (adults+children+babies) (Hint - you can copy and paste the code above where we calculated little_ones and add on another thing)
Q Can you figure out what the code below is doing - comment on each line
remember you can add a comment using the hashtag #
# PLEASE ADD YOUR COMMENTS!
'little_ones'] = DF_raw_hotels['children'] + DF_raw_hotels['babies']
DF_raw_hotels[= ['hotel','little_ones']
my_columns = DF_raw_hotels[my_columns].value_counts()
my_counts = my_counts.reset_index().rename(columns={"index": "value", 0: "count"})
my_counts 'proportion']=my_counts['count']/sum(my_counts['count']) my_counts[
LOOK AT THE RESULTS - Is it what you expected? Do most people travel with little ones? What were the largest and smallest values?
show(my_counts)
hotel | little_ones | count | proportion |
---|---|---|---|
Loading ITables v2.1.4 from the internet...
(need help?) |
Grouping the data frame into chunks
Grouping allows us to grab data in categorical groups and then apply a function to all of the variables in those groups.
In the code below we first have Python make groups using the .groupby command. In this case we are grouping by the ‘hotel’ column which has two different options (City Hotel and Resort Hotel).
DF_raw_hotels.groupby(by=['hotel'])
Once the data is grouped we can do different operations. Here we do a sum() over all of the variables by adding .sum() to the end.
DF_raw_hotels.groupby(by=['hotel']).sum()
=['hotel']).sum() DF_raw_hotels.groupby(by
is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | ... | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | little_ones | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hotel | |||||||||||||||||||||
City Hotel | 33102 | 8705335 | 159943106 | JulyJulyJulyJulyJulyJulyJulyJulyJulyJulyJulyJu... | 2155987 | 1252353 | 63082 | 173174 | 146838 | 7248.0 | ... | 2003876.0 | 535907.0 | 255980 | TransientTransientTransientTransientTransientT... | 8353803.24 | 1933 | 43387 | Check-OutCanceledCanceledCanceledCanceledCance... | 2015-07-032015-07-012015-04-302015-06-232015-0... | 7640.0 |
Resort Hotel | 11122 | 3712588 | 80765825 | JulyJulyJulyJulyJulyJulyJulyJulyJulyJulyJulyJu... | 1087263 | 633799 | 47664 | 125337 | 74798 | 5155.0 | ... | 6929877.0 | 750539.0 | 21142 | TransientTransientTransientTransientTransientT... | 3803814.36 | 5531 | 24828 | Check-OutCheck-OutCheck-OutCheck-OutCheck-OutC... | 2015-07-012015-07-012015-07-022015-07-022015-0... | 5712.0 |
2 rows × 32 columns
Notice some of the categorical columns are really messy! That is because summing means “adding” and for categorical values this just makes one big long combined word.
Let’s select just columns we care about, columns that are numerical:
= ['adults','children','little_ones','babies']
my_columns =['hotel'])[my_columns].sum() DF_raw_hotels.groupby(by
adults | children | little_ones | babies | |
---|---|---|---|---|
hotel | ||||
City Hotel | 146838 | 7248.0 | 7640.0 | 392 |
Resort Hotel | 74798 | 5155.0 | 5712.0 | 557 |
There are lots of different operaitons you could use:
- .min()
- .max()
- .mean()
- .median()
- .sum()
- .prod()
- .count()
- .describe()
Lets see what happens if you try something else:
=['hotel'])[my_columns].describe()) show(DF_raw_hotels.groupby(by
adults | children | little_ones | babies | |||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
hotel | ||||||||||||||||||||||||||||||||
Loading ITables v2.1.4 from the internet...
(need help?) |
Here we see that .describe() does all the descriptive statistics for each of columns, broken into the two groups.
Grouping to quickly generate data
The .groupby() function lets you quickly generate data about things you care about. In just one command below we are able to look at the total number of occupants for each of the columns (adults, children, little_ones, and babies) for each month in the data set.
=['arrival_date_month'])[my_columns].sum() DF_raw_hotels.groupby(by
adults | children | little_ones | babies | |
---|---|---|---|---|
arrival_date_month | ||||
April | 20806 | 1141.0 | 1194.0 | 53 |
August | 27795 | 2780.0 | 2976.0 | 196 |
December | 12382 | 736.0 | 814.0 | 78 |
February | 14450 | 790.0 | 849.0 | 59 |
January | 10024 | 452.0 | 500.0 | 48 |
July | 25164 | 2322.0 | 2443.0 | 121 |
June | 20353 | 1057.0 | 1128.0 | 71 |
March | 17675 | 700.0 | 757.0 | 57 |
May | 21539 | 845.0 | 917.0 | 72 |
November | 11488 | 279.0 | 324.0 | 45 |
October | 20279 | 703.0 | 765.0 | 62 |
September | 19681 | 598.0 | 685.0 | 87 |
Q Try calculating the mean, max, and min for the above data.
# Here are some exercises for you to try
Our first big-ish problem:
Now you will try to use our hotels data set to answer some questions! You will need to use the tools we have learned so far to look into the data and try to answer the questions. Really try to answer the questions with as few hints as possible!
IF YOU USE MY ANSWER - avoid plagiarism
To avoid plagiarism, if you end up copying my answer and using it, YOU MUST explain clearly what each part of the code is doing. It’s okay to need to use my code at this point, but it is not helpful to your learning to blindly copy and paste without trying to understand.
Exercise 1
Are people traveling on a whim?
How could you find out?
Try creating a mask filtering for hotel bookings where the guest is not from the US (country
code "USA"
) and the lead_time
is less than 1 day.
Here is some code where you can fill in the blanks:
mask = (DF_raw_hotels[__________]!=__________) & (DF_raw_hotels[__________]<__________) DF_final = DF_raw_hotels[mask]
What proportion of people traveled last minute?
To get the total number of people who traveled last minute, count up the number of results you get (after masking) using the .len() command.
The proportion is the length of the masked data frame divide by the length of the original data frame.
mask = (DF_raw_hotels['country']!='USA') & (DF_raw_hotels['lead_time']<1)
DF_final = DF_raw_hotels[mask]
len(DF_final)
# Get the proportion
len(DF_final)/len(DF_raw_hotels)
Exercise 2
How many bookings involve at least 1 child or baby?
You will need to use more advanced masking here. Try using the | operator.
Try to fill in the blanks to create the mask
mask = (DF_raw_hotels[______________]>__________) | (DF_raw_hotels[__________]>_________)
mask = (DF_raw_hotels['children']>0) | (DF_raw_hotels['babies']>0)
DF_final = DF_raw_hotels[mask]
len(DF_final)
Exercise 3
Do you think it’s more likely to find bookings with children or babies in city hotels or resort hotels? Test your intuition.
This problem is asking you to count up how many bookings had either children or babies, but to group the data into two chunks based on the hotel type.
Use the groupby() command to group by the ‘hotel’ column and focus on just the [‘children’,‘babies’] before taking the sum
DF_final.groupby(___________________)[_______________________].______________
kids = DF_final.groupby('hotel')[['children','babies']].sum()
kids = kids.reset_index().rename(columns={"index": "value", 0: "count"})
kids
Exercise 4
Create a frequency table of the number of adults
in a booking. Display the results in descending order so the most common observation is on top. What is the most common number of adults in bookings in this dataset? Are there any surprising results?
A frequency table is the same as doing .value_counts() on a column that you care about.
Here is some code where you can fill in the blanks
DF_raw_hotels[______________]._______________
adults = DF_raw_hotels['adults'].value_counts()
adults = adults.reset_index().rename(columns={"index": "value", 0: "count"})
adults
Exercise 5
Repeat Exercise 5, once for canceled bookings (is_canceled
coded as 1) and once for not canceled bookings (is_canceled
coded as 0). What does this reveal about the surprising results you spotted in the previous exercise?
A frequency table is the same as doing .value_counts() on a column that you care about. This time you first want to group by the column ‘is_canceled’
Here is some code where you can fill in the blanks
DF_raw_hotels.groupby(_____________)[___________].______________
adults_canceled = DF_raw_hotels.groupby('is_canceled')['adults'].value_counts()
adults_canceled = adults_canceled.reset_index().rename(columns={"index": "value", 0: "count"})
adults_canceled
Exercise 6
Calculate minimum, mean, median, and maximum average daily rate (column labeled ‘adr’) grouped by hotel
type so that you can get these statistics separately for resort and city hotels. (Hint - try using .describe() after doing group by)
Which type of hotel is higher, on average?
The easiest way to do this is to first group by the hotel type then select the column and then do the statistics.
You can either use .describe() to get the info all at once OR do separate calculations .min(), .mean(), .median(), .max()
DF_raw_hotels.groupby('hotel')['adr'].describe()
Exercise 7
We observe two unusual values in the summary statistics above – a negative minimum, and a very high maximum). What types of hotels are these?
Locate these observations in the data set and find out the arrival date (year and month) as well as how many people (adults, children, and babies) stayed in the room.
You can investigate the data in the viewer to locate these values, but preferably you should identify them in a reproducible way with some code.
Try doing this in two separate cells:
- Mask the data from for where ‘adr’ = the max value, then select the columns you care about showing
- Mask the data from for where ‘adr’ = the min value, then select the columns you care about showing
Here is the list of columns we care about:
my_columns =[‘hotel’,‘arrival_date_year’,‘arrival_date_month’,‘adults’,‘children’,‘babies’,‘is_canceled’,‘adr’]
Beware of typos.
# Find the max
lookup = 5400
# My columns
my_columns =['hotel','arrival_date_year','arrival_date_month','adults','children','babies','is_canceled','adr']
mask = DF_raw_hotels['adr']==lookup
show(DF_raw_hotels[mask][my_columns])
# Find the min
lookup = -6.38
mask = DF_raw_hotels['adr']==lookup
show(DF_raw_hotels[mask][my_columns])