Introduction to Data Science

Important Information

Email: joanna_bieri@redlands.edu
Office Hours: Duke 209 Click Here for Joanna’s Schedule

Announcements

Please come to office hours to get help!

Day 6 Assignment - same drill.

Make sure Pull any new content from the class repo - then Copy it over into your working diretory.
Open the file Day3-HW.ipynb and start doing the problems.
- You can do these problems as you follow along with the lecture notes and video.
Get as far as you can before class.
Submit what you have so far Commit and Push to Git.
Take the daily check in quiz on Canvas.
Come to class with lots of questions!

—————————–

The Data Science Lifecycle

Today we will continue talking about Data Wrangling.

    graph LR
    subgraph Data Science Initial Exploration
    A((Subject Mater Understanding <br> Define the Question))-->B((Data Gathering <br> Mining <br> Ethical Qusetions )) 
    B-->C((Data Cleaning <br> Wrangling))
    C-->D((Data Exploration <br> Visualization <br> Exploratory Data Analysis))
    end
    
    subgraph Data Science Modeling
    D-->E((Feature Engineering <br> Data Preparation))
    E-->F((Predictive Modeling))
    F-->G((Data Visualization <br> Communication <br> Sharing))
    end
    G-->A

    style C fill:#f9f,stroke:#333,stroke-width:4px

Figure 1: Data Science Lifecycle

Last Class - Definitions

Make sure you can define and come up with examples for

Raw Data
Tidy Data
Displayed Data
Summarized Data

Last Class - Code

We learned how to select subsets of the data and sort the data. You should be able to:

Find all the column names and print them.
Slice the data by selecting a subset of the columns
Sort the data either ascending or descending
Mask the data using logical operators

This class

Our goal is to continue learning commands that helps us retrieve pieces of the data. Today we will:

Learn how to do more advanced masking
Find count unique values with value_counts() and drop_duplicates()
Add new columns
Use group_by() to do statistics on groups

This lab follows the Data Science in a Box lab “Working with a single data frame” by Mine Çetinkaya-Rundel. The content has been updated for our class and translated to Python by Joanna Bieri.

Load the Data and look at the basic info

Same data as last class!

Data from two hotels: one resort and one city hotel
Observations: Each row represents a hotel booking
Goal for original data collection: Development of prediction models to classify a hotel booking’s likelihood to be canceled Antonia et al., 2019

Data Information: Tidy Tuesday

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

file_location = 'https://joannabieri.com/introdatascience/data/hotels.csv'
DF_raw_hotels = pd.read_csv(file_location)

show(DF_raw_hotels)

hotel	is_canceled	lead_time	arrival_date_year	arrival_date_month	arrival_date_week_number	arrival_date_day_of_month	stays_in_weekend_nights	stays_in_week_nights	adults	children	babies	meal	country	market_segment	distribution_channel	is_repeated_guest	previous_cancellations	previous_bookings_not_canceled	reserved_room_type	assigned_room_type	booking_changes	deposit_type	agent	company	days_in_waiting_list	customer_type	adr	required_car_parking_spaces	total_of_special_requests	reservation_status	reservation_status_date
Loading ITables v2.1.4 from the internet... (need help?)

# I always look at the column names - they have to be spelled exactly!
columns_list = list(DF_raw_hotels.keys())
print(columns_list)

['hotel', 'is_canceled', 'lead_time', 'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'meal', 'country', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', 'agent', 'company', 'days_in_waiting_list', 'customer_type', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'reservation_status', 'reservation_status_date']

More advanced boolean masks - combining masks

Last time we learned about boolean masks. We were able to use a mask to only show certain rows or columns in the data.

Basic Operators

Operator	Definition
<	less than
>	greater than
<=	less than or equal to
>=	greater than or equal to
==	exactly equal to
!=	not equal to

Advanced Operators

Operator	Definition
and	check if two things are both true
or	check if one of two things is true
in	checks if something is in another thing
!	not checks if something is false

In an effort to keep things simple, we learned how to apply these masks one at a time.

Let’s answer the question:

Show the results for Resort Hotel where the reservation was canceled (is_canceled=1)

mask1 = (DF_raw_hotels['hotel']=='Resort Hotel')
DF_hotels_mask=DF_raw_hotels[mask1]
mask2 = (DF_hotels_mask['is_canceled']==1)
DF_final = DF_hotels_mask[mask2]
DF_final

	hotel	is_canceled	lead_time	arrival_date_year	arrival_date_month	arrival_date_week_number	arrival_date_day_of_month	stays_in_weekend_nights	stays_in_week_nights	adults	...	deposit_type	agent	company	days_in_waiting_list	customer_type	adr	required_car_parking_spaces	total_of_special_requests	reservation_status	reservation_status_date
8	Resort Hotel	1	85	2015	July	27	1	0	3	2	...	No Deposit	240.0	NaN	0	Transient	82.00	0	1	Canceled	2015-05-06
9	Resort Hotel	1	75	2015	July	27	1	0	3	2	...	No Deposit	15.0	NaN	0	Transient	105.50	0	0	Canceled	2015-04-22
10	Resort Hotel	1	23	2015	July	27	1	0	4	2	...	No Deposit	240.0	NaN	0	Transient	123.00	0	0	Canceled	2015-06-23
27	Resort Hotel	1	60	2015	July	27	1	2	5	2	...	No Deposit	240.0	NaN	0	Transient	107.00	0	2	Canceled	2015-05-11
32	Resort Hotel	1	96	2015	July	27	1	2	8	2	...	No Deposit	NaN	NaN	0	Transient	108.30	0	2	Canceled	2015-05-29
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32682	Resort Hotel	1	0	2017	March	12	20	1	0	1	...	No Deposit	NaN	88.0	0	Transient	35.00	0	0	Canceled	2017-03-20
32684	Resort Hotel	1	0	2017	March	12	20	1	0	1	...	No Deposit	NaN	88.0	0	Transient	35.00	0	0	Canceled	2017-03-20
32835	Resort Hotel	1	20	2017	February	6	7	0	1	2	...	No Deposit	250.0	NaN	0	Transient	45.00	0	1	Canceled	2017-02-07
34707	Resort Hotel	1	0	2017	August	34	24	0	1	1	...	No Deposit	NaN	NaN	0	Transient	190.00	0	0	Canceled	2017-08-24
36837	Resort Hotel	1	258	2017	May	22	31	2	4	2	...	No Deposit	143.0	NaN	0	Transient	68.95	0	1	No-Show	2017-05-31

11122 rows × 32 columns

Example of Combining Masks

Once you understand the above code…

You can merge this command into one line using:

Operator	Definition
&	check if two things are both true (bitwise = works on multiple values)
\|	check if one of two things is true (bitwise = works on multiple values)

The code below does the exact same operation - combining the two conditions into just one mask

# Mask hotel='Resort Hotel' and 'is_canceled'=1
mask = (DF_raw_hotels['hotel']=='Resort Hotel') & (DF_raw_hotels['is_canceled']==1)
DF_raw_hotels[mask]

	hotel	is_canceled	lead_time	arrival_date_year	arrival_date_month	arrival_date_week_number	arrival_date_day_of_month	stays_in_weekend_nights	stays_in_week_nights	adults	...	deposit_type	agent	company	days_in_waiting_list	customer_type	adr	required_car_parking_spaces	total_of_special_requests	reservation_status	reservation_status_date
8	Resort Hotel	1	85	2015	July	27	1	0	3	2	...	No Deposit	240.0	NaN	0	Transient	82.00	0	1	Canceled	2015-05-06
9	Resort Hotel	1	75	2015	July	27	1	0	3	2	...	No Deposit	15.0	NaN	0	Transient	105.50	0	0	Canceled	2015-04-22
10	Resort Hotel	1	23	2015	July	27	1	0	4	2	...	No Deposit	240.0	NaN	0	Transient	123.00	0	0	Canceled	2015-06-23
27	Resort Hotel	1	60	2015	July	27	1	2	5	2	...	No Deposit	240.0	NaN	0	Transient	107.00	0	2	Canceled	2015-05-11
32	Resort Hotel	1	96	2015	July	27	1	2	8	2	...	No Deposit	NaN	NaN	0	Transient	108.30	0	2	Canceled	2015-05-29
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32682	Resort Hotel	1	0	2017	March	12	20	1	0	1	...	No Deposit	NaN	88.0	0	Transient	35.00	0	0	Canceled	2017-03-20
32684	Resort Hotel	1	0	2017	March	12	20	1	0	1	...	No Deposit	NaN	88.0	0	Transient	35.00	0	0	Canceled	2017-03-20
32835	Resort Hotel	1	20	2017	February	6	7	0	1	2	...	No Deposit	250.0	NaN	0	Transient	45.00	0	1	Canceled	2017-02-07
34707	Resort Hotel	1	0	2017	August	34	24	0	1	1	...	No Deposit	NaN	NaN	0	Transient	190.00	0	0	Canceled	2017-08-24
36837	Resort Hotel	1	258	2017	May	22	31	2	4	2	...	No Deposit	143.0	NaN	0	Transient	68.95	0	1	No-Show	2017-05-31

11122 rows × 32 columns

Another Example of Combining Masks

This time we will ask:

Show only results for visitors from USA or GBR.

mask = (DF_raw_hotels['country']=='USA') | (DF_raw_hotels['country']=='GBR')
DF_raw_hotels[mask]

	hotel	is_canceled	lead_time	arrival_date_year	arrival_date_month	arrival_date_week_number	arrival_date_day_of_month	stays_in_weekend_nights	stays_in_week_nights	adults	...	deposit_type	agent	company	days_in_waiting_list	customer_type	adr	required_car_parking_spaces	total_of_special_requests	reservation_status	reservation_status_date
2	Resort Hotel	0	7	2015	July	27	1	0	1	1	...	No Deposit	NaN	NaN	0	Transient	75.00	0	0	Check-Out	2015-07-02
3	Resort Hotel	0	13	2015	July	27	1	0	1	1	...	No Deposit	304.0	NaN	0	Transient	75.00	0	0	Check-Out	2015-07-02
4	Resort Hotel	0	14	2015	July	27	1	0	2	2	...	No Deposit	240.0	NaN	0	Transient	98.00	0	1	Check-Out	2015-07-03
5	Resort Hotel	0	14	2015	July	27	1	0	2	2	...	No Deposit	240.0	NaN	0	Transient	98.00	0	1	Check-Out	2015-07-03
12	Resort Hotel	0	68	2015	July	27	1	0	4	2	...	No Deposit	240.0	NaN	0	Transient	97.00	0	3	Check-Out	2015-07-05
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
119358	City Hotel	0	135	2017	August	35	30	1	4	2	...	No Deposit	14.0	NaN	0	Transient	143.10	0	1	Check-Out	2017-09-04
119369	City Hotel	0	207	2017	August	34	25	3	7	2	...	No Deposit	14.0	NaN	0	Transient	96.25	0	2	Check-Out	2017-09-04
119378	City Hotel	0	247	2017	August	35	31	1	3	2	...	No Deposit	42.0	NaN	0	Transient	86.85	0	0	Check-Out	2017-09-04
119379	City Hotel	0	109	2017	August	35	31	1	3	2	...	No Deposit	9.0	NaN	0	Transient	155.00	0	1	Check-Out	2017-09-04
119388	City Hotel	0	109	2017	August	35	31	2	5	2	...	No Deposit	89.0	NaN	0	Transient	104.40	0	0	Check-Out	2017-09-07

14226 rows × 32 columns

Finding unique values

Often we want to see how many different categories are contained in a categorical column. This lets us see what the choices are for that entry.

Unique in one column

For example, in the hotel data, maybe we want to see what possible values the column ‘market_segment’ can take. Sometimes this helps us to understand a column and it always lets us see what values are in the data set. We will use .drop_duplicates() to see the unique values.

NOTICE - I am building from things we learned in our past classes. Here are I selecting just one column

my_columns = ['market_segment']

then grabbing that column of the data frame

DF_raw_hotels[my_columns]

then applying the drop_duplicates() command

DF_raw_hotels[my_columns].drop_duplicates()

my_columns = ['market_segment']
DF_raw_hotels[my_columns].drop_duplicates()

	market_segment
0	Direct
3	Corporate
4	Online TA
9	Offline TA/TO
125	Complementary
413	Groups
40600	Undefined
49013	Aviation

Unique across two columns

We can expand the number of columns we care about and the command .drop_duplicates() will look at BOTH columns for duplicates. In the code below you see that we see duplicates if we read down just one column, but there are no duplicates if we consider both columns.

my_columns = ['market_segment','customer_type']
DF_raw_hotels[my_columns].drop_duplicates()

	market_segment	customer_type
0	Direct	Transient
3	Corporate	Transient
4	Online TA	Transient
9	Offline TA/TO	Transient
16	Offline TA/TO	Contract
47	Offline TA/TO	Transient-Party
125	Complementary	Transient
127	Online TA	Transient-Party
260	Direct	Transient-Party
413	Groups	Transient-Party
417	Groups	Transient
539	Online TA	Group
1530	Corporate	Transient-Party
1539	Direct	Group
1587	Offline TA/TO	Group
2262	Corporate	Contract
2263	Direct	Contract
2698	Groups	Contract
9511	Complementary	Transient-Party
14051	Complementary	Group
14654	Groups	Group
15141	Corporate	Group
15261	Complementary	Contract
15506	Online TA	Contract
40600	Undefined	Transient-Party
49013	Aviation	Transient-Party
49372	Aviation	Transient
112102	Aviation	Group

Counting unique values in one column

Sometimes instead of a list of unique values we want to see counts of how often those labels were seen in the data. This is called: Frequency, Value Counts, Frequency Counts, etc.

In Python we can use the .value_counts() command. We have seen this before.

my_columns = ['market_segment']
DF_raw_hotels[my_columns].value_counts()

market_segment
Online TA         56477
Offline TA/TO     24219
Groups            19811
Direct            12606
Corporate          5295
Complementary       743
Aviation            237
Undefined             2
Name: count, dtype: int64

Counting unique values in two columns

BUT did you know that you can do these kinds of counts in more than one column!?!? All you have to do is select multiple columns and the value count will happen across those columns.

Notice below that now we separate Online TA into four subgroups:

Online TA       Transient          51299
Online TA       Transient-Party     3513
Online TA       Contract            1486
Online TA       Group                179

my_columns = ['market_segment','customer_type']
DF_raw_hotels[my_columns].value_counts()

market_segment  customer_type  
Online TA       Transient          51299
Offline TA/TO   Transient          14054
Direct          Transient          11336
Groups          Transient-Party    10633
                Transient           8427
Offline TA/TO   Transient-Party     8137
Corporate       Transient           3576
Online TA       Transient-Party     3513
Offline TA/TO   Contract            1817
Corporate       Transient-Party     1668
Online TA       Contract            1486
Direct          Transient-Party     1122
Groups          Contract             735
Complementary   Transient            703
Aviation        Transient            218
Offline TA/TO   Group                211
Online TA       Group                179
Direct          Group                134
Complementary   Transient-Party       32
Corporate       Group                 29
                Contract              22
Aviation        Transient-Party       17
Groups          Group                 16
Direct          Contract              14
Complementary   Group                  6
Aviation        Group                  2
Complementary   Contract               2
Undefined       Transient-Party        2
Name: count, dtype: int64

Counting unique values - Unsorted

Sometimes the way that .value_counts() sorts the data in descending order, actually makes the data harder to read! You can use the flag sort=False to stop this sorting.

my_columns = ['market_segment','customer_type']
DF_raw_hotels[my_columns].value_counts(sort=False)

market_segment  customer_type  
Aviation        Group                  2
                Transient            218
                Transient-Party       17
Complementary   Contract               2
                Group                  6
                Transient            703
                Transient-Party       32
Corporate       Contract              22
                Group                 29
                Transient           3576
                Transient-Party     1668
Direct          Contract              14
                Group                134
                Transient          11336
                Transient-Party     1122
Groups          Contract             735
                Group                 16
                Transient           8427
                Transient-Party    10633
Offline TA/TO   Contract            1817
                Group                211
                Transient          14054
                Transient-Party     8137
Online TA       Contract            1486
                Group                179
                Transient          51299
                Transient-Party     3513
Undefined       Transient-Party        2
Name: count, dtype: int64

Counting unique values - Ascending

You can also ask to sort the data ascending with ascending=True

my_columns = ['market_segment','customer_type']
DF_raw_hotels[my_columns].value_counts(ascending=True)

market_segment  customer_type  
Aviation        Group                  2
Complementary   Contract               2
Undefined       Transient-Party        2
Complementary   Group                  6
Direct          Contract              14
Groups          Group                 16
Aviation        Transient-Party       17
Corporate       Contract              22
                Group                 29
Complementary   Transient-Party       32
Direct          Group                134
Online TA       Group                179
Offline TA/TO   Group                211
Aviation        Transient            218
Complementary   Transient            703
Groups          Contract             735
Direct          Transient-Party     1122
Online TA       Contract            1486
Corporate       Transient-Party     1668
Offline TA/TO   Contract            1817
Online TA       Transient-Party     3513
Corporate       Transient           3576
Offline TA/TO   Transient-Party     8137
Groups          Transient           8427
                Transient-Party    10633
Direct          Transient          11336
Offline TA/TO   Transient          14054
Online TA       Transient          51299
Name: count, dtype: int64

Saving Unique Values to a Data Frame

You see that the code above returns something that is not so pretty to look at, and for me not as easy to work with. So most of the time you will want to convert the results of a value count back into a data frame. Here is a break down of what this code does:

First we do the value counts like normal and then save the information in the variable my_counts

  my_columns = ['market_segment','customer_type']
  my_counts = DF_raw_hotels[my_columns].value_counts()

Then we do something call a rest and rename. To do this we add the command .reset_index().rename(columns={“index”: “value”, 0: “count”}). In this class you can just copy and past this after the variable name.
```
  my_counts = my_counts.reset_index().rename(columns={"index": "value", 0: "count"})
```

Now you will see that when we show the value counts they look much nicer AND we will be able to do calculations with these columns.

my_columns = ['market_segment','customer_type']
my_counts = DF_raw_hotels[my_columns].value_counts()
my_counts = my_counts.reset_index().rename(columns={"index": "value", 0: "count"})
show(my_counts)

market_segment	customer_type	count
Loading ITables v2.1.4 from the internet... (need help?)

Adding a column to a Data Frame

In many cases your data will have information across multiple columns that you will want to combine. You can do this by adding a new column to your data frame.

In the raw data there are two columns ‘children’ and ‘babies’, but what if I actually wanted to know the total number of little ones that families were traveling with. In other words, I wanted to combine the information about children and babies (add it up). We do this my adding a new column!

First you call your data frame and put in a name that is not already in the columns:
```
  DF_raw_hotels['little_ones']
```

Then you set that equal to the calculation you want to do

  = DF_raw_hotels['children'] + DF_raw_hotels['babies']

This will save the new calculation in a column with the name ‘little_ones’

Scroll all the way to the right and you will see the new variable (column)

DF_raw_hotels['little_ones'] = DF_raw_hotels['children'] + DF_raw_hotels['babies']

DF_raw_hotels

	hotel	is_canceled	lead_time	arrival_date_year	arrival_date_month	arrival_date_week_number	arrival_date_day_of_month	stays_in_weekend_nights	stays_in_week_nights	adults	...	agent	company	days_in_waiting_list	customer_type	adr	required_car_parking_spaces	total_of_special_requests	reservation_status	reservation_status_date	little_ones
0	Resort Hotel	0	342	2015	July	27	1	0	0	2	...	NaN	NaN	0	Transient	0.00	0	0	Check-Out	2015-07-01	0.0
1	Resort Hotel	0	737	2015	July	27	1	0	0	2	...	NaN	NaN	0	Transient	0.00	0	0	Check-Out	2015-07-01	0.0
2	Resort Hotel	0	7	2015	July	27	1	0	1	1	...	NaN	NaN	0	Transient	75.00	0	0	Check-Out	2015-07-02	0.0
3	Resort Hotel	0	13	2015	July	27	1	0	1	1	...	304.0	NaN	0	Transient	75.00	0	0	Check-Out	2015-07-02	0.0
4	Resort Hotel	0	14	2015	July	27	1	0	2	2	...	240.0	NaN	0	Transient	98.00	0	1	Check-Out	2015-07-03	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
119385	City Hotel	0	23	2017	August	35	30	2	5	2	...	394.0	NaN	0	Transient	96.14	0	0	Check-Out	2017-09-06	0.0
119386	City Hotel	0	102	2017	August	35	31	2	5	3	...	9.0	NaN	0	Transient	225.43	0	2	Check-Out	2017-09-07	0.0
119387	City Hotel	0	34	2017	August	35	31	2	5	2	...	9.0	NaN	0	Transient	157.71	0	4	Check-Out	2017-09-07	0.0
119388	City Hotel	0	109	2017	August	35	31	2	5	2	...	89.0	NaN	0	Transient	104.40	0	0	Check-Out	2017-09-07	0.0
119389	City Hotel	0	205	2017	August	35	29	2	7	2	...	9.0	NaN	0	Transient	151.20	0	2	Check-Out	2017-09-07	0.0

119390 rows × 33 columns

Q Count up the number of little_ones in different hotel types (Hint - use value_counts() with hotel and little_ones as your columns.)

Q Add a new column that calculates the total number of humans (all_humans) in the room (adults+children+babies) (Hint - you can copy and paste the code above where we calculated little_ones and add on another thing)

Q Can you figure out what the code below is doing - comment on each line

remember you can add a comment using the hashtag #

# PLEASE ADD YOUR COMMENTS!
DF_raw_hotels['little_ones'] = DF_raw_hotels['children'] + DF_raw_hotels['babies']
my_columns = ['hotel','little_ones']
my_counts = DF_raw_hotels[my_columns].value_counts()
my_counts = my_counts.reset_index().rename(columns={"index": "value", 0: "count"})
my_counts['proportion']=my_counts['count']/sum(my_counts['count'])

LOOK AT THE RESULTS - Is it what you expected? Do most people travel with little ones? What were the largest and smallest values?

show(my_counts)

hotel	little_ones	count	proportion
Loading ITables v2.1.4 from the internet... (need help?)

Grouping the data frame into chunks

Grouping allows us to grab data in categorical groups and then apply a function to all of the variables in those groups.

In the code below we first have Python make groups using the .groupby command. In this case we are grouping by the ‘hotel’ column which has two different options (City Hotel and Resort Hotel).

DF_raw_hotels.groupby(by=['hotel'])

Once the data is grouped we can do different operations. Here we do a sum() over all of the variables by adding .sum() to the end.

DF_raw_hotels.groupby(by=['hotel']).sum()

DF_raw_hotels.groupby(by=['hotel']).sum()

	is_canceled	lead_time	arrival_date_year	arrival_date_month	arrival_date_week_number	arrival_date_day_of_month	stays_in_weekend_nights	stays_in_week_nights	adults	children	...	agent	company	days_in_waiting_list	customer_type	adr	required_car_parking_spaces	total_of_special_requests	reservation_status	reservation_status_date	little_ones
hotel
City Hotel	33102	8705335	159943106	JulyJulyJulyJulyJulyJulyJulyJulyJulyJulyJulyJu...	2155987	1252353	63082	173174	146838	7248.0	...	2003876.0	535907.0	255980	TransientTransientTransientTransientTransientT...	8353803.24	1933	43387	Check-OutCanceledCanceledCanceledCanceledCance...	2015-07-032015-07-012015-04-302015-06-232015-0...	7640.0
Resort Hotel	11122	3712588	80765825	JulyJulyJulyJulyJulyJulyJulyJulyJulyJulyJulyJu...	1087263	633799	47664	125337	74798	5155.0	...	6929877.0	750539.0	21142	TransientTransientTransientTransientTransientT...	3803814.36	5531	24828	Check-OutCheck-OutCheck-OutCheck-OutCheck-OutC...	2015-07-012015-07-012015-07-022015-07-022015-0...	5712.0

2 rows × 32 columns

Notice some of the categorical columns are really messy! That is because summing means “adding” and for categorical values this just makes one big long combined word.

Let’s select just columns we care about, columns that are numerical:

my_columns = ['adults','children','little_ones','babies']
DF_raw_hotels.groupby(by=['hotel'])[my_columns].sum()

	adults	children	little_ones	babies
hotel
City Hotel	146838	7248.0	7640.0	392
Resort Hotel	74798	5155.0	5712.0	557

There are lots of different operaitons you could use:

.min()
.max()
.mean()
.median()
.sum()
.prod()
.count()
.describe()

Lets see what happens if you try something else:

show(DF_raw_hotels.groupby(by=['hotel'])[my_columns].describe())

	adults								children								little_ones								babies
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
hotel
Loading ITables v2.1.4 from the internet... (need help?)

Here we see that .describe() does all the descriptive statistics for each of columns, broken into the two groups.

Grouping to quickly generate data

The .groupby() function lets you quickly generate data about things you care about. In just one command below we are able to look at the total number of occupants for each of the columns (adults, children, little_ones, and babies) for each month in the data set.

DF_raw_hotels.groupby(by=['arrival_date_month'])[my_columns].sum()

	adults	children	little_ones	babies
arrival_date_month
April	20806	1141.0	1194.0	53
August	27795	2780.0	2976.0	196
December	12382	736.0	814.0	78
February	14450	790.0	849.0	59
January	10024	452.0	500.0	48
July	25164	2322.0	2443.0	121
June	20353	1057.0	1128.0	71
March	17675	700.0	757.0	57
May	21539	845.0	917.0	72
November	11488	279.0	324.0	45
October	20279	703.0	765.0	62
September	19681	598.0	685.0	87

Q Try calculating the mean, max, and min for the above data.

# Here are some exercises for you to try

Our first big-ish problem:

Now you will try to use our hotels data set to answer some questions! You will need to use the tools we have learned so far to look into the data and try to answer the questions. Really try to answer the questions with as few hints as possible!

IF YOU USE MY ANSWER - avoid plagiarism

To avoid plagiarism, if you end up copying my answer and using it, YOU MUST explain clearly what each part of the code is doing. It’s okay to need to use my code at this point, but it is not helpful to your learning to blindly copy and paste without trying to understand.

Exercise 1

Are people traveling on a whim?

How could you find out?

Try creating a mask filtering for hotel bookings where the guest is not from the US (country code "USA") and the lead_time is less than 1 day.

Hint 1

Here is some code where you can fill in the blanks:

mask = (DF_raw_hotels[__________]!=__________) & (DF_raw_hotels[__________]<__________) DF_final = DF_raw_hotels[mask]

What proportion of people traveled last minute?

Hint 2

To get the total number of people who traveled last minute, count up the number of results you get (after masking) using the .len() command.

Hint 3

The proportion is the length of the masked data frame divide by the length of the original data frame.

My Answer

mask = (DF_raw_hotels['country']!='USA') & (DF_raw_hotels['lead_time']<1)
DF_final = DF_raw_hotels[mask]
len(DF_final)
# Get the proportion
len(DF_final)/len(DF_raw_hotels)

Exercise 2

How many bookings involve at least 1 child or baby?

Hint 1

You will need to use more advanced masking here. Try using the | operator.

Hint 2

Try to fill in the blanks to create the mask

mask = (DF_raw_hotels[______________]>__________) | (DF_raw_hotels[__________]>_________)

My Answer

mask = (DF_raw_hotels['children']>0) | (DF_raw_hotels['babies']>0)
DF_final = DF_raw_hotels[mask]
len(DF_final)

Exercise 3

Do you think it’s more likely to find bookings with children or babies in city hotels or resort hotels? Test your intuition.

Hint 1

This problem is asking you to count up how many bookings had either children or babies, but to group the data into two chunks based on the hotel type.

Hint 2

Use the groupby() command to group by the ‘hotel’ column and focus on just the [‘children’,‘babies’] before taking the sum

DF_final.groupby(___________________)[_______________________].______________

My Answer

kids = DF_final.groupby('hotel')[['children','babies']].sum()
kids = kids.reset_index().rename(columns={"index": "value", 0: "count"})
kids

Exercise 4

Create a frequency table of the number of adults in a booking. Display the results in descending order so the most common observation is on top. What is the most common number of adults in bookings in this dataset? Are there any surprising results?

Hint 1

A frequency table is the same as doing .value_counts() on a column that you care about.

Hint 2

Here is some code where you can fill in the blanks

DF_raw_hotels[______________]._______________

My Answer

adults = DF_raw_hotels['adults'].value_counts()
adults = adults.reset_index().rename(columns={"index": "value", 0: "count"})
adults

Exercise 5

Repeat Exercise 5, once for canceled bookings (is_canceled coded as 1) and once for not canceled bookings (is_canceled coded as 0). What does this reveal about the surprising results you spotted in the previous exercise?

Hint 1

A frequency table is the same as doing .value_counts() on a column that you care about. This time you first want to group by the column ‘is_canceled’

Hint 2

Here is some code where you can fill in the blanks

DF_raw_hotels.groupby(_____________)[___________].______________

My Answer

adults_canceled = DF_raw_hotels.groupby('is_canceled')['adults'].value_counts()
adults_canceled = adults_canceled.reset_index().rename(columns={"index": "value", 0: "count"})
adults_canceled

Exercise 6

Calculate minimum, mean, median, and maximum average daily rate (column labeled ‘adr’) grouped by hotel type so that you can get these statistics separately for resort and city hotels. (Hint - try using .describe() after doing group by)

Which type of hotel is higher, on average?

Hint 1

The easiest way to do this is to first group by the hotel type then select the column and then do the statistics.

Hint 2

You can either use .describe() to get the info all at once OR do separate calculations .min(), .mean(), .median(), .max()

My Answer

DF_raw_hotels.groupby('hotel')['adr'].describe()

Exercise 7

We observe two unusual values in the summary statistics above – a negative minimum, and a very high maximum). What types of hotels are these?

Locate these observations in the data set and find out the arrival date (year and month) as well as how many people (adults, children, and babies) stayed in the room.

You can investigate the data in the viewer to locate these values, but preferably you should identify them in a reproducible way with some code.

Hint 1

Try doing this in two separate cells:

Mask the data from for where ‘adr’ = the max value, then select the columns you care about showing
Mask the data from for where ‘adr’ = the min value, then select the columns you care about showing

Hint 2

Here is the list of columns we care about:

my_columns =[‘hotel’,‘arrival_date_year’,‘arrival_date_month’,‘adults’,‘children’,‘babies’,‘is_canceled’,‘adr’]

Beware of typos.

My Answer - max

# Find the max
lookup = 5400
# My columns
my_columns  =['hotel','arrival_date_year','arrival_date_month','adults','children','babies','is_canceled','adr']


mask = DF_raw_hotels['adr']==lookup
show(DF_raw_hotels[mask][my_columns])

My Answer - min

# Find the min
lookup = -6.38
mask = DF_raw_hotels['adr']==lookup
show(DF_raw_hotels[mask][my_columns])