import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
= 'colab' pio.renderers.default
Introduction to Data Science
Numerical vs. Categorical Data
Important Information
- Email: joanna_bieri@redlands.edu
- Office Hours: Duke 209 Click Here for Joanna’s Schedule
Announcements
Please come to office hours to get help!
Remember to come to lab! Wednesday 6-8pm Duke 206.
Day 4 Assignment
- Make sure Pull any new content from the class repo - then Copy it over into your working diretory.
- Open the file Day4-HW.ipynb and start doing the problems.
- You can do these problems as you follow along with the lecture notes and video.
- Get as far as you can before class.
- Submit what you have so far Commit and Push to Git.
- Take the daily check in quiz on Canvas.
- Come to class with lots of questions!
—————————–
Data Types:
Already we have defined:
- Structured Data
- Unstructured Data
- Semi-structured Data
Q Write down a definition of these in your own words and give an example.
These definitions have to do with how the data is presented to you (how it is organized or accessed) but there are even more ways to think about data types.
Number of Variables and Fancy Words:
The number of variables involved impacts the analysis
- Univariate - single variable
- Salaries of employees at a company
- Number of patients treated each day
- Bivariate - two variables
- Temperature vs ice cream sales
- Age vs blood pressure
- Study time vs test score
- Multivariate - many variables
- Health study about diet, exercise, blood pressure, and disease.
- Temperature, humidity, rainfall, and wind speed over time for given locations.
- Market research, age, income, purchasing habits, and education.
Types of Variables:
- Numerical variables - values that are numbers: scores, percents, counts, time, price.
- Categorical - values that are not numbers: yes/no, colors, brands, words, preferences, letter grades, birthdates, race, sex, gender.
Numerical Data
Numerical variables can be classified as
- continuous - infinite number of possible values, subdivisions (real numbers)
- temperature
- weight
- price
- discrete - only specific values (integers)
- birth year
- hours in the day
Categorical Data
Categorical variables can be classified as
- ordinal - has some natural ordering.
- names can be ordered alphabetically
- Olympic medals can be ordered gold, silver, bronze
- nominal - lacks natural ordering
- blood type
- hair color
- movie genre
Q Give your own example of each of the data types: Numerical (continuous and discrete) and Categorical (ordinal and nominal).
Explore some of these data types
Load the Data
= 'https://joannabieri.com/introdatascience/data/loans_full_schema.csv'
file_location = pd.read_csv(file_location) DF
DF
emp_title | emp_length | state | homeownership | annual_income | verified_income | debt_to_income | annual_income_joint | verification_income_joint | debt_to_income_joint | ... | sub_grade | issue_month | loan_status | initial_listing_status | disbursement_method | balance | paid_total | paid_principal | paid_interest | paid_late_fees | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | global config engineer | 3.0 | NJ | MORTGAGE | 90000.0 | Verified | 18.01 | NaN | NaN | NaN | ... | C3 | Mar-2018 | Current | whole | Cash | 27015.86 | 1999.33 | 984.14 | 1015.19 | 0.0 |
1 | warehouse office clerk | 10.0 | HI | RENT | 40000.0 | Not Verified | 5.04 | NaN | NaN | NaN | ... | C1 | Feb-2018 | Current | whole | Cash | 4651.37 | 499.12 | 348.63 | 150.49 | 0.0 |
2 | assembly | 3.0 | WI | RENT | 40000.0 | Source Verified | 21.15 | NaN | NaN | NaN | ... | D1 | Feb-2018 | Current | fractional | Cash | 1824.63 | 281.80 | 175.37 | 106.43 | 0.0 |
3 | customer service | 1.0 | PA | RENT | 30000.0 | Not Verified | 10.16 | NaN | NaN | NaN | ... | A3 | Jan-2018 | Current | whole | Cash | 18853.26 | 3312.89 | 2746.74 | 566.15 | 0.0 |
4 | security supervisor | 10.0 | CA | RENT | 35000.0 | Verified | 57.96 | 57000.0 | Verified | 37.66 | ... | C3 | Mar-2018 | Current | whole | Cash | 21430.15 | 2324.65 | 1569.85 | 754.80 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | owner | 10.0 | TX | RENT | 108000.0 | Source Verified | 22.28 | NaN | NaN | NaN | ... | A4 | Jan-2018 | Current | whole | Cash | 21586.34 | 2969.80 | 2413.66 | 556.14 | 0.0 |
9996 | director | 8.0 | PA | MORTGAGE | 121000.0 | Verified | 32.38 | NaN | NaN | NaN | ... | D3 | Feb-2018 | Current | whole | Cash | 9147.44 | 1456.31 | 852.56 | 603.75 | 0.0 |
9997 | toolmaker | 10.0 | CT | MORTGAGE | 67000.0 | Verified | 45.26 | 107000.0 | Source Verified | 29.57 | ... | E2 | Feb-2018 | Current | fractional | Cash | 27617.65 | 4620.80 | 2382.35 | 2238.45 | 0.0 |
9998 | manager | 1.0 | WI | MORTGAGE | 80000.0 | Source Verified | 11.99 | NaN | NaN | NaN | ... | A1 | Feb-2018 | Current | whole | Cash | 21518.12 | 2873.31 | 2481.88 | 391.43 | 0.0 |
9999 | operations analyst | 3.0 | CT | RENT | 66000.0 | Not Verified | 20.82 | NaN | NaN | NaN | ... | B4 | Feb-2018 | Current | whole | Cash | 11574.83 | 1658.56 | 1225.17 | 433.39 | 0.0 |
10000 rows × 55 columns
Q How many observations are there? Q How many variables are there?
HINT: You can use the code from last class where we looked at the output from DF.shape
Reduce the number of variables
Let’s look at a subset of the variables since there are so many!
Here I choose a few of the column names to focus on and store them in “my_variables”. Then the next command reduces the dataframe to just have those columns.
= ['loan_amount',
my_variables 'interest_rate',
'term','grade',
'state',
'annual_income',
'homeownership',
'debt_to_income']
= DF[my_variables]
DF
DF
loan_amount | interest_rate | term | grade | state | annual_income | homeownership | debt_to_income | |
---|---|---|---|---|---|---|---|---|
0 | 28000 | 14.07 | 60 | C | NJ | 90000.0 | MORTGAGE | 18.01 |
1 | 5000 | 12.61 | 36 | C | HI | 40000.0 | RENT | 5.04 |
2 | 2000 | 17.09 | 36 | D | WI | 40000.0 | RENT | 21.15 |
3 | 21600 | 6.72 | 36 | A | PA | 30000.0 | RENT | 10.16 |
4 | 23000 | 14.07 | 36 | C | CA | 35000.0 | RENT | 57.96 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | 24000 | 7.35 | 36 | A | TX | 108000.0 | RENT | 22.28 |
9996 | 10000 | 19.03 | 36 | D | PA | 121000.0 | MORTGAGE | 32.38 |
9997 | 30000 | 23.88 | 36 | E | CT | 67000.0 | MORTGAGE | 45.26 |
9998 | 24000 | 5.32 | 36 | A | WI | 80000.0 | MORTGAGE | 11.99 |
9999 | 12800 | 10.91 | 36 | B | CT | 66000.0 | RENT | 20.82 |
10000 rows × 8 columns
Q Check out each of the variables (columns):
- What does each column tell you? What are the units?
- Is the data numerical? If so is it continuous or discrete?
- If the categorical? If so is it ordinal or nominal?
Here is a link to the full data description if you need to look up some of the column names.
variable | type |
---|---|
loan_amount | numerical, continuous |
interest_rate | numerical, continuous |
term | numerical, discrete |
grade | categorical, ordinal |
state | categorical, nominal |
annual_income | numerical, continuous |
homeownership | categorical, nominal |
debt_to_income | numerical, continuous |
—————————–
Visualizing Numerical Data
Summary Statistics
With numerical data it is possible to quickly do summary statistics. We might want to know things like:
- Measures of Center - mean, median
- Measures of Spread - range (max-min), standard deviation, inter-quartile range (IQR=Q3-Q1)
DF.describe()
loan_amount | interest_rate | term | annual_income | debt_to_income | |
---|---|---|---|---|---|
count | 10000.000000 | 10000.000000 | 10000.000000 | 1.000000e+04 | 9976.000000 |
mean | 16361.922500 | 12.427524 | 43.272000 | 7.922215e+04 | 19.308192 |
std | 10301.956759 | 5.001105 | 11.029877 | 6.473429e+04 | 15.004851 |
min | 1000.000000 | 5.310000 | 36.000000 | 0.000000e+00 | 0.000000 |
25% | 8000.000000 | 9.430000 | 36.000000 | 4.500000e+04 | 11.057500 |
50% | 14500.000000 | 11.980000 | 36.000000 | 6.500000e+04 | 17.570000 |
75% | 24000.000000 | 15.050000 | 60.000000 | 9.500000e+04 | 25.002500 |
max | 40000.000000 | 30.940000 | 60.000000 | 2.300000e+06 | 469.090000 |
Notice that the .describe() operation automatically drops the categorical data since it does not make sense to do these statistics on that data.
We are also interested in:
- The Shape of the Data - right-skewed, left-skewed, symmetric, unimodal, bimodal, etc.
- Observations of Outliers - use IQR or or visually
Histograms
A histogram is a plot of the count of the number of data points within a range (bin). Imagine taking each data point and assigning it to a range.
We will start simple and just plot a histogram of loan amounts. Here is what this does:
- Look at the data range (minimum loan amount = 1000 and maximum loan amount = 40000) then break this range up into bins (in this case 1000 dollar bins)
- Now look at each observation. If the loan amount was between 1000-1990 it goes in bin 1, between 2000-2990 it goes in bin 2, and so on.
- Count up the number of observations in each bin - this is the bar height.
Here is a simple histogram
= px.histogram(DF,
fig ='loan_amount')
x fig.show()
Now to make this nicer lets do a few things:
- Change the color just for fun (color_discrete_sequence)
- Add a gap between the bars (bargap)
- Give it a title that we want centered (title_x=0.5)
- Add some gaps on the left and right hand side (xaxis={‘range’:[0, 42000]})
= px.histogram(DF,
fig ='loan_amount',
x=['lightseagreen'])
color_discrete_sequence
=0.1,
fig.update_layout(bargap='Histogram of Loan Ammounts.',
title=0.5,
title_x={'range':[0, 42000]})
xaxis fig.show()
In this case we let plotly decide what what the number of bins (or bars) should be. Most of the time you will want to be in charge of this!
- I added the nbins=10 command
= px.histogram(DF,
fig =10,
nbins='loan_amount',
x=['lightseagreen'])
color_discrete_sequence
=0.1,
fig.update_layout(bargap='Histogram of Loan Ammounts.',
title=0.5)
title_x fig.show()
Q You try changing the number of bins nbins what do you notice? Are there good choices? Bad choices?
Sometimes we want to specify the number of bins and sometimes we want to specify the bin width. These two things are related! We start by calculating the range:
range = max - min
then
bin width = (range) / (number of bins)
or
number of bins = (range) / (bin width)
here is how to make Python do this work for you
# First get the range!
= max(DF['loan_amount'])
max_val = min(DF['loan_amount'])
min_val = max_val - min_val
data_range
# Say we know the width we want
= 1000
bin_width # Calculate the number of bins
= data_range/bin_width
nbins print(nbins)
39.0
Then I can use nbins=39 to make the plot with the width I want. If this value is a decimal I would round up or down (my choice)
= px.histogram(DF,
fig =39,
nbins='loan_amount',
x=['lightseagreen'])
color_discrete_sequence
=0.1,
fig.update_layout(bargap='Histogram of Loan Ammounts.',
title=0.5)
title_x fig.show()
Customizing Histograms
There are lots of things we might want to do to make our histograms look more fancy. This might include
- Better labels on the axis
- Breaking up the columns to show categorical variable
- Separating the data by a categorical variable
Better labels
I can rename the axis just by taking their default name and changing it.
- I relabel the x-axis with Loan Amount ($)
- I relabel the y-axis with Frequency
= px.histogram(DF,
fig =9,
nbins='loan_amount',
x=['lightseagreen'],
color_discrete_sequence
)
=0.1,
fig.update_layout(bargap='Histogram of Loan Ammounts.',
title=0.5,
title_x="Loan Amount ($)",
xaxis_title="Frequency")
yaxis_title fig.show()
Adding a categorical fill
Here we can overlay a categorical value on top of our numerical histogram. For example maybe we have the question: What is the break down of loan amounts among people who own, rent or have a mortgage? We can add this information to our histogram.
- now I add color=“homeownership” - this colors the areas of the bar based on the categorical data found in the homeownership column.
- I also add a title for the legend legend_title=“Homeownership”
- And just for fun adjusted the opacity=0.5 - to make the bars a bit see through
= px.histogram(DF,
fig =9,
nbins='loan_amount',
x='homeownership',
color=0.5
opacity
)
=0.1,
fig.update_layout(bargap='Histogram of Loan Ammounts.',
title=0.5,
title_x="Loan Amount ($)",
xaxis_title="Frequency",
yaxis_title="Homeownership")
legend_title fig.show()
If I want to specify my own colors I can add a discrete color map, where I tell it what color to make each of the categories.
color_discrete_map={'MORTGAGE': 'darkviolet',
'RENT': 'deeppink',
'OWN': 'darkturquoise'}
You have to spell things EXACTLY right otherwise Python will yell at you!
# With different colors
= px.histogram(DF,
fig =9,
nbins='loan_amount',
x='homeownership',
color=0.5,
opacity={'MORTGAGE': 'darkviolet',
color_discrete_map'RENT': 'deeppink',
'OWN': 'darkturquoise'}
)
=0.1,
fig.update_layout(bargap='Histogram of Loan Ammounts.',
title=0.5,
title_x="Loan Amount ($)",
xaxis_title="Frequency",
yaxis_title="Homeownership")
legend_title fig.show()
Breaking histogram into Facets
Facets let you break apart the categorical data so that you can look at the histograms for each individual category.
Now I have to add quite a few things to the code!
facet_col=‘homeownership’ gives the categorical data used in each separate histogram
facet_col_wrap=1 means I want one graph on each line
Now there are a bunch of weird lines that just make the labels look good - this starts to get overwhelming but here goes…
Take out extra text from the title of each plot
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
Add the word frequency on each plot
fig.update_yaxes(title_text='Frequency', col=1) fig.update_yaxes(title_text='Frequency', col=2) fig.update_yaxes(title_text='Frequency', col=3)
Then I want to make the graph bigger so we can actually see all the information so I select a size using the commands:
autosize=False, width=800, height=500
= px.histogram(DF,
fig =9,
nbins='loan_amount',
x='homeownership',
facet_col=1,
facet_col_wrap='homeownership',
color={'MORTGAGE': 'darkviolet',
color_discrete_map'RENT': 'deeppink',
'OWN': 'darkturquoise'}
)
# This just makes the labels look a bit nicer
lambda a: a.update(text=a.text.split("=")[-1]))
fig.for_each_annotation(
='Frequency', col=1)
fig.update_yaxes(title_text='Frequency', col=2)
fig.update_yaxes(title_text='Frequency', col=3)
fig.update_yaxes(title_text
=0.02,
fig.update_layout(bargap='Histogram of Loan Ammounts.',
title=0.5,
title_x="Loan Amount ($)",
xaxis_title="Homeownership",
legend_title=False,
autosize=800,
width=500)
height
fig.show()
You can do this!
You really can.
Take a deep breath and look back at all the crazy code above. In most of the cells I have just been adding a few new things. I like to SLOWLY build up my graphs. Here is how I start:
fig = px.histogram(DF,
x='loan_amount')
fig.show()
Then I notice that I would like some labels
fig = px.histogram(DF,
x='loan_amount')
fig.update_layout(title='Histogram of Loan Ammounts.',
title_x=0.5,
xaxis_title="Loan Amount ($)",
yaxis_title='Frequency')
fig.show()
and then maybe I change the colors….
You can do this!
You really can.
Q Create a histogram of your own! Try making a histogram of one of the other pieces of numerical data. Make it as fancy as you want. Include some categorical information. Do you learn anything from your graph? If so what?
Adding Distribution Information to Histogram
A distribution (probability distribution) tells you what the most common (likely) variables are.
The histogram gets at this information. The taller the bar the more of the data that fell into that bin, so if you were randomly going to grab one of the data points, it would more likely come from a “tall” bin.
The marginal variable adds information to the plot that can help you understand the underlying data (distribution)
- box: Adds a box plot along the margin, summarizing the distribution. Notice if you hover over this plot it will give you the summary statistics.
- violin: Adds a violin plot along the margin, allowing you to visualize the distribution as a curve.
= px.histogram(DF,
fig =9,
nbins='loan_amount',
x='homeownership',
color=0.5,
opacity={'MORTGAGE': 'darkviolet',
color_discrete_map'RENT': 'deeppink',
'OWN': 'darkturquoise'},
="box"
marginal
)
=0.0,
fig.update_layout(bargap='Histogram of Loan Ammounts.',
title=0.5,
title_x={'range':[-1000, 46000]},
xaxis="Loan Amount ($)",
xaxis_title="Frequency",
yaxis_title="Homeownership",
legend_title=False,
autosize=800,
width=600)
height fig.show()
Q Change the above plot to marginal=“violin” and see what changes
Distribution Information
If you just want to consider the distribution of the data. You can plot the violin or box plots separately.
= px.violin(DF,
fig ='loan_amount',
x='homeownership',
facet_col=1,
facet_col_wrap='homeownership',
color={'MORTGAGE': 'darkviolet',
color_discrete_map'RENT': 'deeppink',
'OWN': 'darkturquoise'},
=800,
width=600)
height
# This just makes the labels look a bit nicer
lambda a: a.update(text=a.text.split("=")[-1]))
fig.for_each_annotation(
fig.show()
= px.box(DF,
fig ='loan_amount',
x='homeownership',
facet_col=1,
facet_col_wrap='homeownership',
color={'MORTGAGE': 'darkviolet',
color_discrete_map'RENT': 'deeppink',
'OWN': 'darkturquoise'},
=800,
width=600)
height
# This just makes the labels look a bit nicer
lambda a: a.update(text=a.text.split("=")[-1]))
fig.for_each_annotation(
fig.show()
————————————
Scatter Plots
We already saw scatter plots, but here is a refresher.
Scatter plots are a great way to compare relationships between numerical values. This helps us get at the idea of how one variable might depend on the other - how they change in relation to one another.
Remember that you can change the colors and markers as much as you want.
= px.scatter(DF,
fig ='debt_to_income',
x='interest_rate',
y=['lightseagreen'])
color_discrete_sequence
='Debt to Income Ratio vs. Interest Rate',
fig.update_layout(title=0.5,
title_x="Debt to Income Ratio",
xaxis_title="Interest Rate",
yaxis_title=False,
autosize=800,
width=500)
height
fig.show()
Heat Maps
Another way to look compare two variables is through a heat map. A heat map combines the idea of a scatter plot (you have x and y locations) and a histogram (you count how many points are within a rectangle).
You don’t have to specify the number of bins in the x and y location. Just leave this out if you want plotly to pick for you.
= px.density_heatmap(DF,
fig ='debt_to_income',
x='interest_rate',
y=100,
nbinsx=50
nbinsy
)
='Debt to Income Ratio vs. Interest Rate',
fig.update_layout(title=0.5,
title_x="Debt to Income Ratio",
xaxis_title="Interest Rate",
yaxis_title=False,
autosize=800,
width=500)
height
fig.show()
—————————-
Visualizing Categorical Data
For categorical data we have to take a different approach. First of all we can’t do simple summary statistics (unless we first convert our categorical data meaningfully into numerical data).
Bar Plot
A bar plot puts counts of the categorical data into a graph. It is similar to a histogram, except we use the categories as bins. You can also make bar plots to compare categorical and numerical data.
Let’s make a bar plot of the number of people in our data set who rent, own, or have a mortgage.
First let’s ask Python to count up the number of items in each category. Luckily Pandas can do this automatically. We just have to ask our data to look at the column “homeownership” and then count up the different values. This is just good data to have.
= DF['homeownership'].value_counts()
counts counts
homeownership
MORTGAGE 4789
RENT 3858
OWN 1353
Name: count, dtype: int64
Now we can do a part plot where we put the homeownership information on the x-axis.
I had to add a command:
fig.update_traces(dict(marker_line_width=0))
to make the bars a nice solid color.
= px.bar(DF,
fig ='homeownership',
x=['lightseagreen'])
color_discrete_sequencedict(marker_line_width=0))
fig.update_traces( fig.show()
Look what happens if we change our data to the y-axis.
= px.bar(DF,
fig ='homeownership',
y=['lightseagreen'])
color_discrete_sequencedict(marker_line_width=0))
fig.update_traces( fig.show()
You can choose what direction you want your bar plots to go!
Customizing Bar Plots
Similar to histograms we can add all sorts of extra information to our bar plots!
Better labels
Q Can you figure out how to add x labels, y labels, and a title to this graph?
fig = px.bar(DF, x=‘homeownership’, color_discrete_sequence=[‘lightseagreen’]) fig.update_traces(dict(marker_line_width=0))
fig.update_layout(title=‘Homeownership Counts’, title_x=0.5, xaxis_title=“Count”, yaxis_title=“Homeownership”, autosize=False, width=800, height=500) fig.show()
Q Try to make your own bar plot of one of the other categorical columns.
Adding categorical fill
You will notice that the commands are really similar to what we did above with histograms. Here we choose a column to identify the color, instead of giving a single color.
= px.bar(DF,
fig ='homeownership',
x='grade')
colordict(marker_line_width=0))
fig.update_traces(
='Homeownership Counts',
fig.update_layout(title=0.5,
title_x="Count",
xaxis_title="Homeownership",
yaxis_title=False,
autosize=800,
width=500)
height fig.show()
Breaking bar graph into Facets
= px.bar(DF,
fig ='homeownership',
x='grade',
facet_col=2,
facet_col_wrap='grade')
colordict(marker_line_width=0))
fig.update_traces(
='Homeownership Counts',
fig.update_layout(title=0.5,
title_x="Count",
xaxis_title="Homeownership",
yaxis_title=False,
autosize=800,
width=500)
height fig.show()
There are so many types of graphs!!!
Examples of what plotly can do!
This is just the start of what you can do using Python to create graphs.
Be patient, it takes time to build up a good graph. Start simple and slowly build up adding one thing at a time.
There are even other more fancy graphing modules that you can try out, but for now, let’s just get good a plotly.