Introduction to Data Science

Numerical vs. Categorical Data

Author

Joanna Bieri
DATA101

Important Information

Email: joanna_bieri@redlands.edu
Office Hours: Duke 209 Click Here for Joanna’s Schedule

Announcements

Please come to office hours to get help!

Remember to come to lab! Wednesday 6-8pm Duke 206.

Day 4 Assignment

Make sure Pull any new content from the class repo - then Copy it over into your working diretory.
Open the file Day4-HW.ipynb and start doing the problems.
- You can do these problems as you follow along with the lecture notes and video.
Get as far as you can before class.
Submit what you have so far Commit and Push to Git.
Take the daily check in quiz on Canvas.
Come to class with lots of questions!

—————————–

Data Types:

Already we have defined:

Structured Data
Unstructured Data
Semi-structured Data

Q Write down a definition of these in your own words and give an example.

These definitions have to do with how the data is presented to you (how it is organized or accessed) but there are even more ways to think about data types.

Number of Variables and Fancy Words:

The number of variables involved impacts the analysis

Univariate - single variable
- Salaries of employees at a company
- Number of patients treated each day
Bivariate - two variables
- Temperature vs ice cream sales
- Age vs blood pressure
- Study time vs test score
Multivariate - many variables
- Health study about diet, exercise, blood pressure, and disease.
- Temperature, humidity, rainfall, and wind speed over time for given locations.
- Market research, age, income, purchasing habits, and education.

Types of Variables:

Numerical variables - values that are numbers: scores, percents, counts, time, price.
Categorical - values that are not numbers: yes/no, colors, brands, words, preferences, letter grades, birthdates, race, sex, gender.

Numerical Data

Numerical variables can be classified as

continuous - infinite number of possible values, subdivisions (real numbers)
- temperature
- weight
- price
discrete - only specific values (integers)
- birth year
- hours in the day

Categorical Data

Categorical variables can be classified as

ordinal - has some natural ordering.
- names can be ordered alphabetically
- Olympic medals can be ordered gold, silver, bronze
nominal - lacks natural ordering
- blood type
- hair color
- movie genre

Q Give your own example of each of the data types: Numerical (continuous and discrete) and Categorical (ordinal and nominal).

Explore some of these data types

Load the Data

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'colab'

file_location = 'https://joannabieri.com/introdatascience/data/loans_full_schema.csv'
DF = pd.read_csv(file_location)

DF

	emp_title	emp_length	state	homeownership	annual_income	verified_income	debt_to_income	annual_income_joint	verification_income_joint	debt_to_income_joint	...	sub_grade	issue_month	loan_status	initial_listing_status	disbursement_method	balance	paid_total	paid_principal	paid_interest	paid_late_fees
0	global config engineer	3.0	NJ	MORTGAGE	90000.0	Verified	18.01	NaN	NaN	NaN	...	C3	Mar-2018	Current	whole	Cash	27015.86	1999.33	984.14	1015.19	0.0
1	warehouse office clerk	10.0	HI	RENT	40000.0	Not Verified	5.04	NaN	NaN	NaN	...	C1	Feb-2018	Current	whole	Cash	4651.37	499.12	348.63	150.49	0.0
2	assembly	3.0	WI	RENT	40000.0	Source Verified	21.15	NaN	NaN	NaN	...	D1	Feb-2018	Current	fractional	Cash	1824.63	281.80	175.37	106.43	0.0
3	customer service	1.0	PA	RENT	30000.0	Not Verified	10.16	NaN	NaN	NaN	...	A3	Jan-2018	Current	whole	Cash	18853.26	3312.89	2746.74	566.15	0.0
4	security supervisor	10.0	CA	RENT	35000.0	Verified	57.96	57000.0	Verified	37.66	...	C3	Mar-2018	Current	whole	Cash	21430.15	2324.65	1569.85	754.80	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9995	owner	10.0	TX	RENT	108000.0	Source Verified	22.28	NaN	NaN	NaN	...	A4	Jan-2018	Current	whole	Cash	21586.34	2969.80	2413.66	556.14	0.0
9996	director	8.0	PA	MORTGAGE	121000.0	Verified	32.38	NaN	NaN	NaN	...	D3	Feb-2018	Current	whole	Cash	9147.44	1456.31	852.56	603.75	0.0
9997	toolmaker	10.0	CT	MORTGAGE	67000.0	Verified	45.26	107000.0	Source Verified	29.57	...	E2	Feb-2018	Current	fractional	Cash	27617.65	4620.80	2382.35	2238.45	0.0
9998	manager	1.0	WI	MORTGAGE	80000.0	Source Verified	11.99	NaN	NaN	NaN	...	A1	Feb-2018	Current	whole	Cash	21518.12	2873.31	2481.88	391.43	0.0
9999	operations analyst	3.0	CT	RENT	66000.0	Not Verified	20.82	NaN	NaN	NaN	...	B4	Feb-2018	Current	whole	Cash	11574.83	1658.56	1225.17	433.39	0.0

10000 rows × 55 columns

Q How many observations are there? Q How many variables are there?

HINT: You can use the code from last class where we looked at the output from DF.shape

Reduce the number of variables

Let’s look at a subset of the variables since there are so many!

Here I choose a few of the column names to focus on and store them in “my_variables”. Then the next command reduces the dataframe to just have those columns.

my_variables = ['loan_amount',
                'interest_rate',
                'term','grade',
                'state',
                'annual_income',
                'homeownership',
                'debt_to_income']

DF = DF[my_variables]

DF

	loan_amount	interest_rate	term	grade	state	annual_income	homeownership	debt_to_income
0	28000	14.07	60	C	NJ	90000.0	MORTGAGE	18.01
1	5000	12.61	36	C	HI	40000.0	RENT	5.04
2	2000	17.09	36	D	WI	40000.0	RENT	21.15
3	21600	6.72	36	A	PA	30000.0	RENT	10.16
4	23000	14.07	36	C	CA	35000.0	RENT	57.96
...	...	...	...	...	...	...	...	...
9995	24000	7.35	36	A	TX	108000.0	RENT	22.28
9996	10000	19.03	36	D	PA	121000.0	MORTGAGE	32.38
9997	30000	23.88	36	E	CT	67000.0	MORTGAGE	45.26
9998	24000	5.32	36	A	WI	80000.0	MORTGAGE	11.99
9999	12800	10.91	36	B	CT	66000.0	RENT	20.82

10000 rows × 8 columns

Q Check out each of the variables (columns):

What does each column tell you? What are the units?
Is the data numerical? If so is it continuous or discrete?
If the categorical? If so is it ordinal or nominal?

Here is a link to the full data description if you need to look up some of the column names.

Here are the answers to the above questions - please try for yourself first!

variable	type
loan_amount	numerical, continuous
interest_rate	numerical, continuous
term	numerical, discrete
grade	categorical, ordinal
state	categorical, nominal
annual_income	numerical, continuous
homeownership	categorical, nominal
debt_to_income	numerical, continuous

—————————–

Visualizing Numerical Data

Summary Statistics

With numerical data it is possible to quickly do summary statistics. We might want to know things like:

Measures of Center - mean, median
Measures of Spread - range (max-min), standard deviation, inter-quartile range (IQR=Q3-Q1)

DF.describe()

	loan_amount	interest_rate	term	annual_income	debt_to_income
count	10000.000000	10000.000000	10000.000000	1.000000e+04	9976.000000
mean	16361.922500	12.427524	43.272000	7.922215e+04	19.308192
std	10301.956759	5.001105	11.029877	6.473429e+04	15.004851
min	1000.000000	5.310000	36.000000	0.000000e+00	0.000000
25%	8000.000000	9.430000	36.000000	4.500000e+04	11.057500
50%	14500.000000	11.980000	36.000000	6.500000e+04	17.570000
75%	24000.000000	15.050000	60.000000	9.500000e+04	25.002500
max	40000.000000	30.940000	60.000000	2.300000e+06	469.090000

Notice that the .describe() operation automatically drops the categorical data since it does not make sense to do these statistics on that data.

We are also interested in:

The Shape of the Data - right-skewed, left-skewed, symmetric, unimodal, bimodal, etc.
Observations of Outliers - use IQR or or visually

Histograms

A histogram is a plot of the count of the number of data points within a range (bin). Imagine taking each data point and assigning it to a range.

We will start simple and just plot a histogram of loan amounts. Here is what this does:

Look at the data range (minimum loan amount = 1000 and maximum loan amount = 40000) then break this range up into bins (in this case 1000 dollar bins)
Now look at each observation. If the loan amount was between 1000-1990 it goes in bin 1, between 2000-2990 it goes in bin 2, and so on.
Count up the number of observations in each bin - this is the bar height.

Here is a simple histogram

fig = px.histogram(DF,
                   x='loan_amount')
fig.show()

Now to make this nicer lets do a few things:

Change the color just for fun (color_discrete_sequence)
Add a gap between the bars (bargap)
Give it a title that we want centered (title_x=0.5)
Add some gaps on the left and right hand side (xaxis={‘range’:[0, 42000]})

fig = px.histogram(DF,
                   x='loan_amount',
                   color_discrete_sequence=['lightseagreen'])

fig.update_layout(bargap=0.1,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5,
                  xaxis={'range':[0, 42000]})
fig.show()

In this case we let plotly decide what what the number of bins (or bars) should be. Most of the time you will want to be in charge of this!

I added the nbins=10 command

fig = px.histogram(DF,
                   nbins=10,
                   x='loan_amount',
                   color_discrete_sequence=['lightseagreen'])

fig.update_layout(bargap=0.1,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5)
fig.show()

Q You try changing the number of bins nbins what do you notice? Are there good choices? Bad choices?

Sometimes we want to specify the number of bins and sometimes we want to specify the bin width. These two things are related! We start by calculating the range:

range = max - min

then

bin width = (range) / (number of bins)

number of bins = (range) / (bin width)

here is how to make Python do this work for you

# First get the range!
max_val = max(DF['loan_amount'])
min_val = min(DF['loan_amount'])
data_range = max_val - min_val

# Say we know the width we want
bin_width = 1000
# Calculate the number of bins
nbins = data_range/bin_width
print(nbins)

39.0

Then I can use nbins=39 to make the plot with the width I want. If this value is a decimal I would round up or down (my choice)

fig = px.histogram(DF,
                   nbins=39,
                   x='loan_amount',
                   color_discrete_sequence=['lightseagreen'])

fig.update_layout(bargap=0.1,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5)
fig.show()

Customizing Histograms

There are lots of things we might want to do to make our histograms look more fancy. This might include

Better labels on the axis
Breaking up the columns to show categorical variable
Separating the data by a categorical variable

Better labels

I can rename the axis just by taking their default name and changing it.

I relabel the x-axis with Loan Amount ($)
I relabel the y-axis with Frequency

fig = px.histogram(DF,
                   nbins=9,
                   x='loan_amount',
                   color_discrete_sequence=['lightseagreen'],
                   )

fig.update_layout(bargap=0.1,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5,
                  xaxis_title="Loan Amount ($)",
                  yaxis_title="Frequency")
fig.show()

Adding a categorical fill

Here we can overlay a categorical value on top of our numerical histogram. For example maybe we have the question: What is the break down of loan amounts among people who own, rent or have a mortgage? We can add this information to our histogram.

now I add color=“homeownership” - this colors the areas of the bar based on the categorical data found in the homeownership column.
I also add a title for the legend legend_title=“Homeownership”
And just for fun adjusted the opacity=0.5 - to make the bars a bit see through

fig = px.histogram(DF,
                   nbins=9,
                   x='loan_amount',
                   color='homeownership',
                   opacity=0.5
                   )

fig.update_layout(bargap=0.1,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5,
                  xaxis_title="Loan Amount ($)",
                  yaxis_title="Frequency",
                  legend_title="Homeownership")
fig.show()

If I want to specify my own colors I can add a discrete color map, where I tell it what color to make each of the categories.

color_discrete_map={'MORTGAGE': 'darkviolet', 
                    'RENT': 'deeppink', 
                    'OWN': 'darkturquoise'}

You have to spell things EXACTLY right otherwise Python will yell at you!

# With different colors
fig = px.histogram(DF,
                   nbins=9,
                   x='loan_amount',
                   color='homeownership',
                   opacity=0.5,
                   color_discrete_map={'MORTGAGE': 'darkviolet', 
                                       'RENT': 'deeppink', 
                                       'OWN': 'darkturquoise'}
                   )

fig.update_layout(bargap=0.1,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5,
                  xaxis_title="Loan Amount ($)",
                  yaxis_title="Frequency",
                  legend_title="Homeownership")
fig.show()

Breaking histogram into Facets

Facets let you break apart the categorical data so that you can look at the histograms for each individual category.

Now I have to add quite a few things to the code!

facet_col=‘homeownership’ gives the categorical data used in each separate histogram
facet_col_wrap=1 means I want one graph on each line

Now there are a bunch of weird lines that just make the labels look good - this starts to get overwhelming but here goes…

Take out extra text from the title of each plot

fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))

Add the word frequency on each plot

  fig.update_yaxes(title_text='Frequency', col=1)
  fig.update_yaxes(title_text='Frequency', col=2)
  fig.update_yaxes(title_text='Frequency', col=3)

Then I want to make the graph bigger so we can actually see all the information so I select a size using the commands:
```
    autosize=False,
    width=800,
    height=500
```

fig = px.histogram(DF,
                   nbins=9,
                   x='loan_amount',
                   facet_col='homeownership',
                   facet_col_wrap=1,
                   color='homeownership',
                   color_discrete_map={'MORTGAGE': 'darkviolet', 
                                       'RENT': 'deeppink', 
                                       'OWN': 'darkturquoise'}
                   )

# This just makes the labels look a bit nicer
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))

fig.update_yaxes(title_text='Frequency', col=1)
fig.update_yaxes(title_text='Frequency', col=2)
fig.update_yaxes(title_text='Frequency', col=3)

fig.update_layout(bargap=0.02,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5,
                  xaxis_title="Loan Amount ($)",
                  legend_title="Homeownership",
                  autosize=False,
                  width=800,
                  height=500)


fig.show()

It’s easy to get overwhelmed - click here for a pep talk!

You can do this!

You really can.

Take a deep breath and look back at all the crazy code above. In most of the cells I have just been adding a few new things. I like to SLOWLY build up my graphs. Here is how I start:

fig = px.histogram(DF,
                   x='loan_amount')
fig.show()

Then I notice that I would like some labels

fig = px.histogram(DF,
                   x='loan_amount')

fig.update_layout(title='Histogram of Loan Ammounts.',
                  title_x=0.5,
                  xaxis_title="Loan Amount ($)",
                  yaxis_title='Frequency')
fig.show()

and then maybe I change the colors….

You can do this!

You really can.

Q Create a histogram of your own! Try making a histogram of one of the other pieces of numerical data. Make it as fancy as you want. Include some categorical information. Do you learn anything from your graph? If so what?

Adding Distribution Information to Histogram

A distribution (probability distribution) tells you what the most common (likely) variables are.

The histogram gets at this information. The taller the bar the more of the data that fell into that bin, so if you were randomly going to grab one of the data points, it would more likely come from a “tall” bin.

The marginal variable adds information to the plot that can help you understand the underlying data (distribution)

box: Adds a box plot along the margin, summarizing the distribution. Notice if you hover over this plot it will give you the summary statistics.
violin: Adds a violin plot along the margin, allowing you to visualize the distribution as a curve.

fig = px.histogram(DF,
                   nbins=9,
                   x='loan_amount',
                   color='homeownership',
                   opacity=0.5,
                   color_discrete_map={'MORTGAGE': 'darkviolet', 
                                       'RENT': 'deeppink', 
                                       'OWN': 'darkturquoise'},
                   marginal="box"
                   )

fig.update_layout(bargap=0.0,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5,
                  xaxis={'range':[-1000, 46000]},
                  xaxis_title="Loan Amount ($)",
                  yaxis_title="Frequency",
                  legend_title="Homeownership",
                  autosize=False,
                  width=800,
                  height=600)
fig.show()

Q Change the above plot to marginal=“violin” and see what changes

Distribution Information

If you just want to consider the distribution of the data. You can plot the violin or box plots separately.

fig = px.violin(DF,
                x='loan_amount', 
                facet_col='homeownership',
                facet_col_wrap=1,
                color='homeownership',
                color_discrete_map={'MORTGAGE': 'darkviolet', 
                                       'RENT': 'deeppink', 
                                       'OWN': 'darkturquoise'},
                width=800,
                height=600)

# This just makes the labels look a bit nicer
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))

fig.show()

fig = px.box(DF,
                x='loan_amount', 
                facet_col='homeownership',
                facet_col_wrap=1,
                color='homeownership',
                color_discrete_map={'MORTGAGE': 'darkviolet', 
                                       'RENT': 'deeppink', 
                                       'OWN': 'darkturquoise'},
                width=800,
                height=600)

# This just makes the labels look a bit nicer
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))

fig.show()

————————————

Scatter Plots

We already saw scatter plots, but here is a refresher.

Scatter plots are a great way to compare relationships between numerical values. This helps us get at the idea of how one variable might depend on the other - how they change in relation to one another.

Remember that you can change the colors and markers as much as you want.

fig = px.scatter(DF,
                 x='debt_to_income',
                 y='interest_rate',
                color_discrete_sequence=['lightseagreen'])

fig.update_layout(title='Debt to Income Ratio vs. Interest Rate',
                  title_x=0.5,
                  xaxis_title="Debt to Income Ratio",
                  yaxis_title="Interest Rate",
                  autosize=False,
                  width=800,
                  height=500)

fig.show()

Heat Maps

Another way to look compare two variables is through a heat map. A heat map combines the idea of a scatter plot (you have x and y locations) and a histogram (you count how many points are within a rectangle).

You don’t have to specify the number of bins in the x and y location. Just leave this out if you want plotly to pick for you.

fig = px.density_heatmap(DF,
                 x='debt_to_income',
                 y='interest_rate',
                 nbinsx=100,
                 nbinsy=50
                 )

fig.update_layout(title='Debt to Income Ratio vs. Interest Rate',
                  title_x=0.5,
                  xaxis_title="Debt to Income Ratio",
                  yaxis_title="Interest Rate",
                  autosize=False,
                  width=800,
                  height=500)

fig.show()

—————————-

Visualizing Categorical Data

For categorical data we have to take a different approach. First of all we can’t do simple summary statistics (unless we first convert our categorical data meaningfully into numerical data).

Bar Plot

A bar plot puts counts of the categorical data into a graph. It is similar to a histogram, except we use the categories as bins. You can also make bar plots to compare categorical and numerical data.

Let’s make a bar plot of the number of people in our data set who rent, own, or have a mortgage.

First let’s ask Python to count up the number of items in each category. Luckily Pandas can do this automatically. We just have to ask our data to look at the column “homeownership” and then count up the different values. This is just good data to have.

counts = DF['homeownership'].value_counts()
counts

homeownership
MORTGAGE    4789
RENT        3858
OWN         1353
Name: count, dtype: int64

Now we can do a part plot where we put the homeownership information on the x-axis.

I had to add a command:

fig.update_traces(dict(marker_line_width=0))

to make the bars a nice solid color.

fig = px.bar(DF,
            x='homeownership',
            color_discrete_sequence=['lightseagreen'])
fig.update_traces(dict(marker_line_width=0))
fig.show()

Look what happens if we change our data to the y-axis.

fig = px.bar(DF,
            y='homeownership',
            color_discrete_sequence=['lightseagreen'])
fig.update_traces(dict(marker_line_width=0))
fig.show()

You can choose what direction you want your bar plots to go!

Customizing Bar Plots

Similar to histograms we can add all sorts of extra information to our bar plots!

Better labels

Q Can you figure out how to add x labels, y labels, and a title to this graph?

Here is the answers to the above question - please try for yourself first!

fig = px.bar(DF, x=‘homeownership’, color_discrete_sequence=[‘lightseagreen’]) fig.update_traces(dict(marker_line_width=0))

fig.update_layout(title=‘Homeownership Counts’, title_x=0.5, xaxis_title=“Count”, yaxis_title=“Homeownership”, autosize=False, width=800, height=500) fig.show()

Q Try to make your own bar plot of one of the other categorical columns.

Adding categorical fill

You will notice that the commands are really similar to what we did above with histograms. Here we choose a column to identify the color, instead of giving a single color.

fig = px.bar(DF,
            x='homeownership',
            color='grade')
fig.update_traces(dict(marker_line_width=0))

fig.update_layout(title='Homeownership Counts',
                  title_x=0.5,
                  xaxis_title="Count",
                  yaxis_title="Homeownership",
                  autosize=False,
                  width=800,
                  height=500)
fig.show()

Breaking bar graph into Facets

fig = px.bar(DF,
            x='homeownership',
            facet_col='grade',
             facet_col_wrap=2,
            color='grade')
fig.update_traces(dict(marker_line_width=0))

fig.update_layout(title='Homeownership Counts',
                  title_x=0.5,
                  xaxis_title="Count",
                  yaxis_title="Homeownership",
                  autosize=False,
                  width=800,
                  height=500)
fig.show()

There are so many types of graphs!!!

Examples of what plotly can do!

This is just the start of what you can do using Python to create graphs.

Be patient, it takes time to build up a good graph. Start simple and slowly build up adding one thing at a time.

There are even other more fancy graphing modules that you can try out, but for now, let’s just get good a plotly.