Introduction to Data Science

Data Wrangling Continued

Author

Joanna Bieri
DATA101

Important Information

Announcements

Please come to office hours to get help!

Day 6 Assignment - same drill.

  1. Make sure Pull any new content from the class repo - then Copy it over into your working diretory.
  2. Open the file Day3-HW.ipynb and start doing the problems.
    • You can do these problems as you follow along with the lecture notes and video.
  3. Get as far as you can before class.
  4. Submit what you have so far Commit and Push to Git.
  5. Take the daily check in quiz on Canvas.
  6. Come to class with lots of questions!

—————————–

The Data Science Lifecycle

Today we will continue talking about Data Wrangling.

    graph LR
    subgraph Data Science Initial Exploration
    A((Subject Mater Understanding <br> Define the Question))-->B((Data Gathering <br> Mining <br> Ethical Qusetions )) 
    B-->C((Data Cleaning <br> Wrangling))
    C-->D((Data Exploration <br> Visualization <br> Exploratory Data Analysis))
    end
    
    subgraph Data Science Modeling
    D-->E((Feature Engineering <br> Data Preparation))
    E-->F((Predictive Modeling))
    F-->G((Data Visualization <br> Communication <br> Sharing))
    end
    G-->A

    style C fill:#f9f,stroke:#333,stroke-width:4px

Figure 1: Data Science Lifecycle

Last Class - Definitions

Make sure you can define and come up with examples for

  • Raw Data
  • Tidy Data
  • Displayed Data
  • Summarized Data

Last Class - Code

We learned how to select subsets of the data and sort the data. You should be able to:

  • Find all the column names and print them.
  • Slice the data by selecting a subset of the columns
  • Sort the data either ascending or descending
  • Mask the data using logical operators

This class

Our goal is to continue learning commands that helps us retrieve pieces of the data. Today we will:

  • Learn how to do more advanced masking
  • Find count unique values with value_counts() and drop_duplicates()
  • Add new columns
  • Use group_by() to do statistics on groups

This lab follows the Data Science in a Box lab “Working with a single data frame” by Mine Çetinkaya-Rundel. The content has been updated for our class and translated to Python by Joanna Bieri.

Load the Data and look at the basic info

Same data as last class!

  • Data from two hotels: one resort and one city hotel
  • Observations: Each row represents a hotel booking
  • Goal for original data collection: Development of prediction models to classify a hotel booking’s likelihood to be canceled Antonia et al., 2019

Data Information: Tidy Tuesday

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show
file_location = 'https://joannabieri.com/introdatascience/data/hotels.csv'
DF_raw_hotels = pd.read_csv(file_location)
show(DF_raw_hotels)
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children babies meal country market_segment distribution_channel is_repeated_guest previous_cancellations previous_bookings_not_canceled reserved_room_type assigned_room_type booking_changes deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
Loading ITables v2.1.4 from the internet... (need help?)
# I always look at the column names - they have to be spelled exactly!
columns_list = list(DF_raw_hotels.keys())
print(columns_list)
['hotel', 'is_canceled', 'lead_time', 'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'meal', 'country', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', 'agent', 'company', 'days_in_waiting_list', 'customer_type', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'reservation_status', 'reservation_status_date']

More advanced boolean masks - combining masks

Last time we learned about boolean masks. We were able to use a mask to only show certain rows or columns in the data.

Basic Operators

Operator Definition
< less than
> greater than
<= less than or equal to
>= greater than or equal to
== exactly equal to
!= not equal to

Advanced Operators

Operator Definition
and check if two things are both true
or check if one of two things is true
in checks if something is in another thing
! not checks if something is false

In an effort to keep things simple, we learned how to apply these masks one at a time.

Let’s answer the question:

  • Show the results for Resort Hotel where the reservation was canceled (is_canceled=1)
mask1 = (DF_raw_hotels['hotel']=='Resort Hotel')
DF_hotels_mask=DF_raw_hotels[mask1]
mask2 = (DF_hotels_mask['is_canceled']==1)
DF_final = DF_hotels_mask[mask2]
DF_final
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
8 Resort Hotel 1 85 2015 July 27 1 0 3 2 ... No Deposit 240.0 NaN 0 Transient 82.00 0 1 Canceled 2015-05-06
9 Resort Hotel 1 75 2015 July 27 1 0 3 2 ... No Deposit 15.0 NaN 0 Transient 105.50 0 0 Canceled 2015-04-22
10 Resort Hotel 1 23 2015 July 27 1 0 4 2 ... No Deposit 240.0 NaN 0 Transient 123.00 0 0 Canceled 2015-06-23
27 Resort Hotel 1 60 2015 July 27 1 2 5 2 ... No Deposit 240.0 NaN 0 Transient 107.00 0 2 Canceled 2015-05-11
32 Resort Hotel 1 96 2015 July 27 1 2 8 2 ... No Deposit NaN NaN 0 Transient 108.30 0 2 Canceled 2015-05-29
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32682 Resort Hotel 1 0 2017 March 12 20 1 0 1 ... No Deposit NaN 88.0 0 Transient 35.00 0 0 Canceled 2017-03-20
32684 Resort Hotel 1 0 2017 March 12 20 1 0 1 ... No Deposit NaN 88.0 0 Transient 35.00 0 0 Canceled 2017-03-20
32835 Resort Hotel 1 20 2017 February 6 7 0 1 2 ... No Deposit 250.0 NaN 0 Transient 45.00 0 1 Canceled 2017-02-07
34707 Resort Hotel 1 0 2017 August 34 24 0 1 1 ... No Deposit NaN NaN 0 Transient 190.00 0 0 Canceled 2017-08-24
36837 Resort Hotel 1 258 2017 May 22 31 2 4 2 ... No Deposit 143.0 NaN 0 Transient 68.95 0 1 No-Show 2017-05-31

11122 rows × 32 columns

Example of Combining Masks

Once you understand the above code…

You can merge this command into one line using:

Operator Definition
& check if two things are both true (bitwise = works on multiple values)
| check if one of two things is true (bitwise = works on multiple values)

The code below does the exact same operation - combining the two conditions into just one mask

# Mask hotel='Resort Hotel' and 'is_canceled'=1
mask = (DF_raw_hotels['hotel']=='Resort Hotel') & (DF_raw_hotels['is_canceled']==1)
DF_raw_hotels[mask]
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
8 Resort Hotel 1 85 2015 July 27 1 0 3 2 ... No Deposit 240.0 NaN 0 Transient 82.00 0 1 Canceled 2015-05-06
9 Resort Hotel 1 75 2015 July 27 1 0 3 2 ... No Deposit 15.0 NaN 0 Transient 105.50 0 0 Canceled 2015-04-22
10 Resort Hotel 1 23 2015 July 27 1 0 4 2 ... No Deposit 240.0 NaN 0 Transient 123.00 0 0 Canceled 2015-06-23
27 Resort Hotel 1 60 2015 July 27 1 2 5 2 ... No Deposit 240.0 NaN 0 Transient 107.00 0 2 Canceled 2015-05-11
32 Resort Hotel 1 96 2015 July 27 1 2 8 2 ... No Deposit NaN NaN 0 Transient 108.30 0 2 Canceled 2015-05-29
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32682 Resort Hotel 1 0 2017 March 12 20 1 0 1 ... No Deposit NaN 88.0 0 Transient 35.00 0 0 Canceled 2017-03-20
32684 Resort Hotel 1 0 2017 March 12 20 1 0 1 ... No Deposit NaN 88.0 0 Transient 35.00 0 0 Canceled 2017-03-20
32835 Resort Hotel 1 20 2017 February 6 7 0 1 2 ... No Deposit 250.0 NaN 0 Transient 45.00 0 1 Canceled 2017-02-07
34707 Resort Hotel 1 0 2017 August 34 24 0 1 1 ... No Deposit NaN NaN 0 Transient 190.00 0 0 Canceled 2017-08-24
36837 Resort Hotel 1 258 2017 May 22 31 2 4 2 ... No Deposit 143.0 NaN 0 Transient 68.95 0 1 No-Show 2017-05-31

11122 rows × 32 columns

Another Example of Combining Masks

This time we will ask:

  • Show only results for visitors from USA or GBR.
mask = (DF_raw_hotels['country']=='USA') | (DF_raw_hotels['country']=='GBR')
DF_raw_hotels[mask]
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit NaN NaN 0 Transient 75.00 0 0 Check-Out 2015-07-02
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 NaN 0 Transient 75.00 0 0 Check-Out 2015-07-02
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 NaN 0 Transient 98.00 0 1 Check-Out 2015-07-03
5 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 NaN 0 Transient 98.00 0 1 Check-Out 2015-07-03
12 Resort Hotel 0 68 2015 July 27 1 0 4 2 ... No Deposit 240.0 NaN 0 Transient 97.00 0 3 Check-Out 2015-07-05
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
119358 City Hotel 0 135 2017 August 35 30 1 4 2 ... No Deposit 14.0 NaN 0 Transient 143.10 0 1 Check-Out 2017-09-04
119369 City Hotel 0 207 2017 August 34 25 3 7 2 ... No Deposit 14.0 NaN 0 Transient 96.25 0 2 Check-Out 2017-09-04
119378 City Hotel 0 247 2017 August 35 31 1 3 2 ... No Deposit 42.0 NaN 0 Transient 86.85 0 0 Check-Out 2017-09-04
119379 City Hotel 0 109 2017 August 35 31 1 3 2 ... No Deposit 9.0 NaN 0 Transient 155.00 0 1 Check-Out 2017-09-04
119388 City Hotel 0 109 2017 August 35 31 2 5 2 ... No Deposit 89.0 NaN 0 Transient 104.40 0 0 Check-Out 2017-09-07

14226 rows × 32 columns

Finding unique values

Often we want to see how many different categories are contained in a categorical column. This lets us see what the choices are for that entry.

Unique in one column

For example, in the hotel data, maybe we want to see what possible values the column ‘market_segment’ can take. Sometimes this helps us to understand a column and it always lets us see what values are in the data set. We will use .drop_duplicates() to see the unique values.

NOTICE - I am building from things we learned in our past classes. Here are I selecting just one column

my_columns = ['market_segment']

then grabbing that column of the data frame

DF_raw_hotels[my_columns]

then applying the drop_duplicates() command

DF_raw_hotels[my_columns].drop_duplicates()   
my_columns = ['market_segment']
DF_raw_hotels[my_columns].drop_duplicates()
market_segment
0 Direct
3 Corporate
4 Online TA
9 Offline TA/TO
125 Complementary
413 Groups
40600 Undefined
49013 Aviation

Unique across two columns

We can expand the number of columns we care about and the command .drop_duplicates() will look at BOTH columns for duplicates. In the code below you see that we see duplicates if we read down just one column, but there are no duplicates if we consider both columns.

my_columns = ['market_segment','customer_type']
DF_raw_hotels[my_columns].drop_duplicates()
market_segment customer_type
0 Direct Transient
3 Corporate Transient
4 Online TA Transient
9 Offline TA/TO Transient
16 Offline TA/TO Contract
47 Offline TA/TO Transient-Party
125 Complementary Transient
127 Online TA Transient-Party
260 Direct Transient-Party
413 Groups Transient-Party
417 Groups Transient
539 Online TA Group
1530 Corporate Transient-Party
1539 Direct Group
1587 Offline TA/TO Group
2262 Corporate Contract
2263 Direct Contract
2698 Groups Contract
9511 Complementary Transient-Party
14051 Complementary Group
14654 Groups Group
15141 Corporate Group
15261 Complementary Contract
15506 Online TA Contract
40600 Undefined Transient-Party
49013 Aviation Transient-Party
49372 Aviation Transient
112102 Aviation Group

Counting unique values in one column

Sometimes instead of a list of unique values we want to see counts of how often those labels were seen in the data. This is called: Frequency, Value Counts, Frequency Counts, etc.

In Python we can use the .value_counts() command. We have seen this before.

my_columns = ['market_segment']
DF_raw_hotels[my_columns].value_counts()
market_segment
Online TA         56477
Offline TA/TO     24219
Groups            19811
Direct            12606
Corporate          5295
Complementary       743
Aviation            237
Undefined             2
Name: count, dtype: int64

Counting unique values in two columns

BUT did you know that you can do these kinds of counts in more than one column!?!? All you have to do is select multiple columns and the value count will happen across those columns.

Notice below that now we separate Online TA into four subgroups:

Online TA       Transient          51299
Online TA       Transient-Party     3513
Online TA       Contract            1486
Online TA       Group                179
my_columns = ['market_segment','customer_type']
DF_raw_hotels[my_columns].value_counts()
market_segment  customer_type  
Online TA       Transient          51299
Offline TA/TO   Transient          14054
Direct          Transient          11336
Groups          Transient-Party    10633
                Transient           8427
Offline TA/TO   Transient-Party     8137
Corporate       Transient           3576
Online TA       Transient-Party     3513
Offline TA/TO   Contract            1817
Corporate       Transient-Party     1668
Online TA       Contract            1486
Direct          Transient-Party     1122
Groups          Contract             735
Complementary   Transient            703
Aviation        Transient            218
Offline TA/TO   Group                211
Online TA       Group                179
Direct          Group                134
Complementary   Transient-Party       32
Corporate       Group                 29
                Contract              22
Aviation        Transient-Party       17
Groups          Group                 16
Direct          Contract              14
Complementary   Group                  6
Aviation        Group                  2
Complementary   Contract               2
Undefined       Transient-Party        2
Name: count, dtype: int64

Counting unique values - Unsorted

Sometimes the way that .value_counts() sorts the data in descending order, actually makes the data harder to read! You can use the flag sort=False to stop this sorting.

my_columns = ['market_segment','customer_type']
DF_raw_hotels[my_columns].value_counts(sort=False)
market_segment  customer_type  
Aviation        Group                  2
                Transient            218
                Transient-Party       17
Complementary   Contract               2
                Group                  6
                Transient            703
                Transient-Party       32
Corporate       Contract              22
                Group                 29
                Transient           3576
                Transient-Party     1668
Direct          Contract              14
                Group                134
                Transient          11336
                Transient-Party     1122
Groups          Contract             735
                Group                 16
                Transient           8427
                Transient-Party    10633
Offline TA/TO   Contract            1817
                Group                211
                Transient          14054
                Transient-Party     8137
Online TA       Contract            1486
                Group                179
                Transient          51299
                Transient-Party     3513
Undefined       Transient-Party        2
Name: count, dtype: int64

Counting unique values - Ascending

You can also ask to sort the data ascending with ascending=True

my_columns = ['market_segment','customer_type']
DF_raw_hotels[my_columns].value_counts(ascending=True)
market_segment  customer_type  
Aviation        Group                  2
Complementary   Contract               2
Undefined       Transient-Party        2
Complementary   Group                  6
Direct          Contract              14
Groups          Group                 16
Aviation        Transient-Party       17
Corporate       Contract              22
                Group                 29
Complementary   Transient-Party       32
Direct          Group                134
Online TA       Group                179
Offline TA/TO   Group                211
Aviation        Transient            218
Complementary   Transient            703
Groups          Contract             735
Direct          Transient-Party     1122
Online TA       Contract            1486
Corporate       Transient-Party     1668
Offline TA/TO   Contract            1817
Online TA       Transient-Party     3513
Corporate       Transient           3576
Offline TA/TO   Transient-Party     8137
Groups          Transient           8427
                Transient-Party    10633
Direct          Transient          11336
Offline TA/TO   Transient          14054
Online TA       Transient          51299
Name: count, dtype: int64

Saving Unique Values to a Data Frame

You see that the code above returns something that is not so pretty to look at, and for me not as easy to work with. So most of the time you will want to convert the results of a value count back into a data frame. Here is a break down of what this code does:

  • First we do the value counts like normal and then save the information in the variable my_counts

      my_columns = ['market_segment','customer_type']
      my_counts = DF_raw_hotels[my_columns].value_counts()
  • Then we do something call a rest and rename. To do this we add the command .reset_index().rename(columns={“index”: “value”, 0: “count”}). In this class you can just copy and past this after the variable name.

      my_counts = my_counts.reset_index().rename(columns={"index": "value", 0: "count"})

Now you will see that when we show the value counts they look much nicer AND we will be able to do calculations with these columns.

my_columns = ['market_segment','customer_type']
my_counts = DF_raw_hotels[my_columns].value_counts()
my_counts = my_counts.reset_index().rename(columns={"index": "value", 0: "count"})
show(my_counts)
market_segment customer_type count
Loading ITables v2.1.4 from the internet... (need help?)

Adding a column to a Data Frame

In many cases your data will have information across multiple columns that you will want to combine. You can do this by adding a new column to your data frame.

In the raw data there are two columns ‘children’ and ‘babies’, but what if I actually wanted to know the total number of little ones that families were traveling with. In other words, I wanted to combine the information about children and babies (add it up). We do this my adding a new column!

  • First you call your data frame and put in a name that is not already in the columns:

      DF_raw_hotels['little_ones']
  • Then you set that equal to the calculation you want to do

      = DF_raw_hotels['children'] + DF_raw_hotels['babies']
  • This will save the new calculation in a column with the name ‘little_ones’

Scroll all the way to the right and you will see the new variable (column)

DF_raw_hotels['little_ones'] = DF_raw_hotels['children'] + DF_raw_hotels['babies']
DF_raw_hotels
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date little_ones
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... NaN NaN 0 Transient 0.00 0 0 Check-Out 2015-07-01 0.0
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... NaN NaN 0 Transient 0.00 0 0 Check-Out 2015-07-01 0.0
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... NaN NaN 0 Transient 75.00 0 0 Check-Out 2015-07-02 0.0
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... 304.0 NaN 0 Transient 75.00 0 0 Check-Out 2015-07-02 0.0
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... 240.0 NaN 0 Transient 98.00 0 1 Check-Out 2015-07-03 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
119385 City Hotel 0 23 2017 August 35 30 2 5 2 ... 394.0 NaN 0 Transient 96.14 0 0 Check-Out 2017-09-06 0.0
119386 City Hotel 0 102 2017 August 35 31 2 5 3 ... 9.0 NaN 0 Transient 225.43 0 2 Check-Out 2017-09-07 0.0
119387 City Hotel 0 34 2017 August 35 31 2 5 2 ... 9.0 NaN 0 Transient 157.71 0 4 Check-Out 2017-09-07 0.0
119388 City Hotel 0 109 2017 August 35 31 2 5 2 ... 89.0 NaN 0 Transient 104.40 0 0 Check-Out 2017-09-07 0.0
119389 City Hotel 0 205 2017 August 35 29 2 7 2 ... 9.0 NaN 0 Transient 151.20 0 2 Check-Out 2017-09-07 0.0

119390 rows × 33 columns

Q Count up the number of little_ones in different hotel types (Hint - use value_counts() with hotel and little_ones as your columns.)

Q Add a new column that calculates the total number of humans (all_humans) in the room (adults+children+babies) (Hint - you can copy and paste the code above where we calculated little_ones and add on another thing)

Q Can you figure out what the code below is doing - comment on each line

remember you can add a comment using the hashtag #

# PLEASE ADD YOUR COMMENTS!
DF_raw_hotels['little_ones'] = DF_raw_hotels['children'] + DF_raw_hotels['babies']
my_columns = ['hotel','little_ones']
my_counts = DF_raw_hotels[my_columns].value_counts()
my_counts = my_counts.reset_index().rename(columns={"index": "value", 0: "count"})
my_counts['proportion']=my_counts['count']/sum(my_counts['count'])

LOOK AT THE RESULTS - Is it what you expected? Do most people travel with little ones? What were the largest and smallest values?

show(my_counts)
hotel little_ones count proportion
Loading ITables v2.1.4 from the internet... (need help?)

Grouping the data frame into chunks

Grouping allows us to grab data in categorical groups and then apply a function to all of the variables in those groups.

In the code below we first have Python make groups using the .groupby command. In this case we are grouping by the ‘hotel’ column which has two different options (City Hotel and Resort Hotel).

DF_raw_hotels.groupby(by=['hotel'])

Once the data is grouped we can do different operations. Here we do a sum() over all of the variables by adding .sum() to the end.

DF_raw_hotels.groupby(by=['hotel']).sum()
DF_raw_hotels.groupby(by=['hotel']).sum()
is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children ... agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date little_ones
hotel
City Hotel 33102 8705335 159943106 JulyJulyJulyJulyJulyJulyJulyJulyJulyJulyJulyJu... 2155987 1252353 63082 173174 146838 7248.0 ... 2003876.0 535907.0 255980 TransientTransientTransientTransientTransientT... 8353803.24 1933 43387 Check-OutCanceledCanceledCanceledCanceledCance... 2015-07-032015-07-012015-04-302015-06-232015-0... 7640.0
Resort Hotel 11122 3712588 80765825 JulyJulyJulyJulyJulyJulyJulyJulyJulyJulyJulyJu... 1087263 633799 47664 125337 74798 5155.0 ... 6929877.0 750539.0 21142 TransientTransientTransientTransientTransientT... 3803814.36 5531 24828 Check-OutCheck-OutCheck-OutCheck-OutCheck-OutC... 2015-07-012015-07-012015-07-022015-07-022015-0... 5712.0

2 rows × 32 columns

Notice some of the categorical columns are really messy! That is because summing means “adding” and for categorical values this just makes one big long combined word.

Let’s select just columns we care about, columns that are numerical:

my_columns = ['adults','children','little_ones','babies']
DF_raw_hotels.groupby(by=['hotel'])[my_columns].sum()
adults children little_ones babies
hotel
City Hotel 146838 7248.0 7640.0 392
Resort Hotel 74798 5155.0 5712.0 557

There are lots of different operaitons you could use:

  • .min()
  • .max()
  • .mean()
  • .median()
  • .sum()
  • .prod()
  • .count()
  • .describe()

Lets see what happens if you try something else:

show(DF_raw_hotels.groupby(by=['hotel'])[my_columns].describe())
adults children little_ones babies
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
hotel
Loading ITables v2.1.4 from the internet... (need help?)

Here we see that .describe() does all the descriptive statistics for each of columns, broken into the two groups.

Grouping to quickly generate data

The .groupby() function lets you quickly generate data about things you care about. In just one command below we are able to look at the total number of occupants for each of the columns (adults, children, little_ones, and babies) for each month in the data set.

DF_raw_hotels.groupby(by=['arrival_date_month'])[my_columns].sum()
adults children little_ones babies
arrival_date_month
April 20806 1141.0 1194.0 53
August 27795 2780.0 2976.0 196
December 12382 736.0 814.0 78
February 14450 790.0 849.0 59
January 10024 452.0 500.0 48
July 25164 2322.0 2443.0 121
June 20353 1057.0 1128.0 71
March 17675 700.0 757.0 57
May 21539 845.0 917.0 72
November 11488 279.0 324.0 45
October 20279 703.0 765.0 62
September 19681 598.0 685.0 87

Q Try calculating the mean, max, and min for the above data.

# Here are some exercises for you to try

Our first big-ish problem:

Now you will try to use our hotels data set to answer some questions! You will need to use the tools we have learned so far to look into the data and try to answer the questions. Really try to answer the questions with as few hints as possible!

IF YOU USE MY ANSWER - avoid plagiarism

To avoid plagiarism, if you end up copying my answer and using it, YOU MUST explain clearly what each part of the code is doing. It’s okay to need to use my code at this point, but it is not helpful to your learning to blindly copy and paste without trying to understand.

Exercise 1

Are people traveling on a whim?

How could you find out?

Try creating a mask filtering for hotel bookings where the guest is not from the US (country code "USA") and the lead_time is less than 1 day.

Here is some code where you can fill in the blanks:

mask = (DF_raw_hotels[__________]!=__________) & (DF_raw_hotels[__________]<__________) DF_final = DF_raw_hotels[mask]

What proportion of people traveled last minute?

To get the total number of people who traveled last minute, count up the number of results you get (after masking) using the .len() command.

The proportion is the length of the masked data frame divide by the length of the original data frame.

mask = (DF_raw_hotels['country']!='USA') & (DF_raw_hotels['lead_time']<1)
DF_final = DF_raw_hotels[mask]
len(DF_final)
# Get the proportion
len(DF_final)/len(DF_raw_hotels)

Exercise 2

How many bookings involve at least 1 child or baby?

You will need to use more advanced masking here. Try using the | operator.

Try to fill in the blanks to create the mask

mask = (DF_raw_hotels[______________]>__________) | (DF_raw_hotels[__________]>_________)

mask = (DF_raw_hotels['children']>0) | (DF_raw_hotels['babies']>0)
DF_final = DF_raw_hotels[mask]
len(DF_final)

Exercise 3

Do you think it’s more likely to find bookings with children or babies in city hotels or resort hotels? Test your intuition.

This problem is asking you to count up how many bookings had either children or babies, but to group the data into two chunks based on the hotel type.

Use the groupby() command to group by the ‘hotel’ column and focus on just the [‘children’,‘babies’] before taking the sum

DF_final.groupby(___________________)[_______________________].______________

kids = DF_final.groupby('hotel')[['children','babies']].sum()
kids = kids.reset_index().rename(columns={"index": "value", 0: "count"})
kids

Exercise 4

Create a frequency table of the number of adults in a booking. Display the results in descending order so the most common observation is on top. What is the most common number of adults in bookings in this dataset? Are there any surprising results?

A frequency table is the same as doing .value_counts() on a column that you care about.

Here is some code where you can fill in the blanks

DF_raw_hotels[______________]._______________

adults = DF_raw_hotels['adults'].value_counts()
adults = adults.reset_index().rename(columns={"index": "value", 0: "count"})
adults

Exercise 5

Repeat Exercise 5, once for canceled bookings (is_canceled coded as 1) and once for not canceled bookings (is_canceled coded as 0). What does this reveal about the surprising results you spotted in the previous exercise?

A frequency table is the same as doing .value_counts() on a column that you care about. This time you first want to group by the column ‘is_canceled’

Here is some code where you can fill in the blanks

DF_raw_hotels.groupby(_____________)[___________].______________

adults_canceled = DF_raw_hotels.groupby('is_canceled')['adults'].value_counts()
adults_canceled = adults_canceled.reset_index().rename(columns={"index": "value", 0: "count"})
adults_canceled

Exercise 6

Calculate minimum, mean, median, and maximum average daily rate (column labeled ‘adr’) grouped by hotel type so that you can get these statistics separately for resort and city hotels. (Hint - try using .describe() after doing group by)

Which type of hotel is higher, on average?

The easiest way to do this is to first group by the hotel type then select the column and then do the statistics.

You can either use .describe() to get the info all at once OR do separate calculations .min(), .mean(), .median(), .max()

DF_raw_hotels.groupby('hotel')['adr'].describe()

Exercise 7

We observe two unusual values in the summary statistics above – a negative minimum, and a very high maximum). What types of hotels are these?

Locate these observations in the data set and find out the arrival date (year and month) as well as how many people (adults, children, and babies) stayed in the room.

You can investigate the data in the viewer to locate these values, but preferably you should identify them in a reproducible way with some code.

Try doing this in two separate cells:

  1. Mask the data from for where ‘adr’ = the max value, then select the columns you care about showing
  2. Mask the data from for where ‘adr’ = the min value, then select the columns you care about showing

Here is the list of columns we care about:

my_columns =[‘hotel’,‘arrival_date_year’,‘arrival_date_month’,‘adults’,‘children’,‘babies’,‘is_canceled’,‘adr’]

Beware of typos.

# Find the max
lookup = 5400
# My columns
my_columns  =['hotel','arrival_date_year','arrival_date_month','adults','children','babies','is_canceled','adr']


mask = DF_raw_hotels['adr']==lookup
show(DF_raw_hotels[mask][my_columns])
# Find the min
lookup = -6.38
mask = DF_raw_hotels['adr']==lookup
show(DF_raw_hotels[mask][my_columns])