Introduction to Data Science

Numerical vs. Categorical Data

Author

Joanna Bieri
DATA101

Important Information

Announcements

Please come to office hours to get help!

Remember to come to lab! Wednesday 6-8pm Duke 206.

Day 4 Assignment

  1. Make sure Pull any new content from the class repo - then Copy it over into your working diretory.
  2. Open the file Day4-HW.ipynb and start doing the problems.
    • You can do these problems as you follow along with the lecture notes and video.
  3. Get as far as you can before class.
  4. Submit what you have so far Commit and Push to Git.
  5. Take the daily check in quiz on Canvas.
  6. Come to class with lots of questions!

—————————–

Data Types:

Already we have defined:

  • Structured Data
  • Unstructured Data
  • Semi-structured Data

Q Write down a definition of these in your own words and give an example.

These definitions have to do with how the data is presented to you (how it is organized or accessed) but there are even more ways to think about data types.

Number of Variables and Fancy Words:

The number of variables involved impacts the analysis

  • Univariate - single variable
    • Salaries of employees at a company
    • Number of patients treated each day
  • Bivariate - two variables
    • Temperature vs ice cream sales
    • Age vs blood pressure
    • Study time vs test score
  • Multivariate - many variables
    • Health study about diet, exercise, blood pressure, and disease.
    • Temperature, humidity, rainfall, and wind speed over time for given locations.
    • Market research, age, income, purchasing habits, and education.

Types of Variables:

  • Numerical variables - values that are numbers: scores, percents, counts, time, price.
  • Categorical - values that are not numbers: yes/no, colors, brands, words, preferences, letter grades, birthdates, race, sex, gender.

Numerical Data

Numerical variables can be classified as

  • continuous - infinite number of possible values, subdivisions (real numbers)
    • temperature
    • weight
    • price
  • discrete - only specific values (integers)
    • birth year
    • hours in the day

Categorical Data

Categorical variables can be classified as

  • ordinal - has some natural ordering.
    • names can be ordered alphabetically
    • Olympic medals can be ordered gold, silver, bronze
  • nominal - lacks natural ordering
    • blood type
    • hair color
    • movie genre

Q Give your own example of each of the data types: Numerical (continuous and discrete) and Categorical (ordinal and nominal).

Explore some of these data types

Load the Data

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'colab'
file_location = 'https://joannabieri.com/introdatascience/data/loans_full_schema.csv'
DF = pd.read_csv(file_location)
DF
emp_title emp_length state homeownership annual_income verified_income debt_to_income annual_income_joint verification_income_joint debt_to_income_joint ... sub_grade issue_month loan_status initial_listing_status disbursement_method balance paid_total paid_principal paid_interest paid_late_fees
0 global config engineer 3.0 NJ MORTGAGE 90000.0 Verified 18.01 NaN NaN NaN ... C3 Mar-2018 Current whole Cash 27015.86 1999.33 984.14 1015.19 0.0
1 warehouse office clerk 10.0 HI RENT 40000.0 Not Verified 5.04 NaN NaN NaN ... C1 Feb-2018 Current whole Cash 4651.37 499.12 348.63 150.49 0.0
2 assembly 3.0 WI RENT 40000.0 Source Verified 21.15 NaN NaN NaN ... D1 Feb-2018 Current fractional Cash 1824.63 281.80 175.37 106.43 0.0
3 customer service 1.0 PA RENT 30000.0 Not Verified 10.16 NaN NaN NaN ... A3 Jan-2018 Current whole Cash 18853.26 3312.89 2746.74 566.15 0.0
4 security supervisor 10.0 CA RENT 35000.0 Verified 57.96 57000.0 Verified 37.66 ... C3 Mar-2018 Current whole Cash 21430.15 2324.65 1569.85 754.80 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 owner 10.0 TX RENT 108000.0 Source Verified 22.28 NaN NaN NaN ... A4 Jan-2018 Current whole Cash 21586.34 2969.80 2413.66 556.14 0.0
9996 director 8.0 PA MORTGAGE 121000.0 Verified 32.38 NaN NaN NaN ... D3 Feb-2018 Current whole Cash 9147.44 1456.31 852.56 603.75 0.0
9997 toolmaker 10.0 CT MORTGAGE 67000.0 Verified 45.26 107000.0 Source Verified 29.57 ... E2 Feb-2018 Current fractional Cash 27617.65 4620.80 2382.35 2238.45 0.0
9998 manager 1.0 WI MORTGAGE 80000.0 Source Verified 11.99 NaN NaN NaN ... A1 Feb-2018 Current whole Cash 21518.12 2873.31 2481.88 391.43 0.0
9999 operations analyst 3.0 CT RENT 66000.0 Not Verified 20.82 NaN NaN NaN ... B4 Feb-2018 Current whole Cash 11574.83 1658.56 1225.17 433.39 0.0

10000 rows × 55 columns

Q How many observations are there? Q How many variables are there?

HINT: You can use the code from last class where we looked at the output from DF.shape

Reduce the number of variables

Let’s look at a subset of the variables since there are so many!

Here I choose a few of the column names to focus on and store them in “my_variables”. Then the next command reduces the dataframe to just have those columns.

my_variables = ['loan_amount',
                'interest_rate',
                'term','grade',
                'state',
                'annual_income',
                'homeownership',
                'debt_to_income']

DF = DF[my_variables]

DF
loan_amount interest_rate term grade state annual_income homeownership debt_to_income
0 28000 14.07 60 C NJ 90000.0 MORTGAGE 18.01
1 5000 12.61 36 C HI 40000.0 RENT 5.04
2 2000 17.09 36 D WI 40000.0 RENT 21.15
3 21600 6.72 36 A PA 30000.0 RENT 10.16
4 23000 14.07 36 C CA 35000.0 RENT 57.96
... ... ... ... ... ... ... ... ...
9995 24000 7.35 36 A TX 108000.0 RENT 22.28
9996 10000 19.03 36 D PA 121000.0 MORTGAGE 32.38
9997 30000 23.88 36 E CT 67000.0 MORTGAGE 45.26
9998 24000 5.32 36 A WI 80000.0 MORTGAGE 11.99
9999 12800 10.91 36 B CT 66000.0 RENT 20.82

10000 rows × 8 columns

Q Check out each of the variables (columns):

  1. What does each column tell you? What are the units?
  2. Is the data numerical? If so is it continuous or discrete?
  3. If the categorical? If so is it ordinal or nominal?

Here is a link to the full data description if you need to look up some of the column names.


variable type
loan_amount numerical, continuous
interest_rate numerical, continuous
term numerical, discrete
grade categorical, ordinal
state categorical, nominal
annual_income numerical, continuous
homeownership categorical, nominal
debt_to_income numerical, continuous

—————————–

Visualizing Numerical Data

Summary Statistics

With numerical data it is possible to quickly do summary statistics. We might want to know things like:

  • Measures of Center - mean, median
  • Measures of Spread - range (max-min), standard deviation, inter-quartile range (IQR=Q3-Q1)
DF.describe()
loan_amount interest_rate term annual_income debt_to_income
count 10000.000000 10000.000000 10000.000000 1.000000e+04 9976.000000
mean 16361.922500 12.427524 43.272000 7.922215e+04 19.308192
std 10301.956759 5.001105 11.029877 6.473429e+04 15.004851
min 1000.000000 5.310000 36.000000 0.000000e+00 0.000000
25% 8000.000000 9.430000 36.000000 4.500000e+04 11.057500
50% 14500.000000 11.980000 36.000000 6.500000e+04 17.570000
75% 24000.000000 15.050000 60.000000 9.500000e+04 25.002500
max 40000.000000 30.940000 60.000000 2.300000e+06 469.090000

Notice that the .describe() operation automatically drops the categorical data since it does not make sense to do these statistics on that data.

We are also interested in:

  • The Shape of the Data - right-skewed, left-skewed, symmetric, unimodal, bimodal, etc.
  • Observations of Outliers - use IQR or or visually

Histograms

A histogram is a plot of the count of the number of data points within a range (bin). Imagine taking each data point and assigning it to a range.

We will start simple and just plot a histogram of loan amounts. Here is what this does:

  • Look at the data range (minimum loan amount = 1000 and maximum loan amount = 40000) then break this range up into bins (in this case 1000 dollar bins)
  • Now look at each observation. If the loan amount was between 1000-1990 it goes in bin 1, between 2000-2990 it goes in bin 2, and so on.
  • Count up the number of observations in each bin - this is the bar height.

Here is a simple histogram

fig = px.histogram(DF,
                   x='loan_amount')
fig.show()

Now to make this nicer lets do a few things:

  • Change the color just for fun (color_discrete_sequence)
  • Add a gap between the bars (bargap)
  • Give it a title that we want centered (title_x=0.5)
  • Add some gaps on the left and right hand side (xaxis={‘range’:[0, 42000]})
fig = px.histogram(DF,
                   x='loan_amount',
                   color_discrete_sequence=['lightseagreen'])

fig.update_layout(bargap=0.1,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5,
                  xaxis={'range':[0, 42000]})
fig.show()

In this case we let plotly decide what what the number of bins (or bars) should be. Most of the time you will want to be in charge of this!

  • I added the nbins=10 command
fig = px.histogram(DF,
                   nbins=10,
                   x='loan_amount',
                   color_discrete_sequence=['lightseagreen'])

fig.update_layout(bargap=0.1,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5)
fig.show()

Q You try changing the number of bins nbins what do you notice? Are there good choices? Bad choices?

Sometimes we want to specify the number of bins and sometimes we want to specify the bin width. These two things are related! We start by calculating the range:

range = max - min

then

bin width = (range) / (number of bins)

or

number of bins = (range) / (bin width)

here is how to make Python do this work for you

# First get the range!
max_val = max(DF['loan_amount'])
min_val = min(DF['loan_amount'])
data_range = max_val - min_val

# Say we know the width we want
bin_width = 1000
# Calculate the number of bins
nbins = data_range/bin_width
print(nbins)
39.0

Then I can use nbins=39 to make the plot with the width I want. If this value is a decimal I would round up or down (my choice)

fig = px.histogram(DF,
                   nbins=39,
                   x='loan_amount',
                   color_discrete_sequence=['lightseagreen'])

fig.update_layout(bargap=0.1,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5)
fig.show()

Customizing Histograms

There are lots of things we might want to do to make our histograms look more fancy. This might include

  • Better labels on the axis
  • Breaking up the columns to show categorical variable
  • Separating the data by a categorical variable

Better labels

I can rename the axis just by taking their default name and changing it.

  • I relabel the x-axis with Loan Amount ($)
  • I relabel the y-axis with Frequency
fig = px.histogram(DF,
                   nbins=9,
                   x='loan_amount',
                   color_discrete_sequence=['lightseagreen'],
                   )

fig.update_layout(bargap=0.1,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5,
                  xaxis_title="Loan Amount ($)",
                  yaxis_title="Frequency")
fig.show()

Adding a categorical fill

Here we can overlay a categorical value on top of our numerical histogram. For example maybe we have the question: What is the break down of loan amounts among people who own, rent or have a mortgage? We can add this information to our histogram.

  • now I add color=“homeownership” - this colors the areas of the bar based on the categorical data found in the homeownership column.
  • I also add a title for the legend legend_title=“Homeownership”
  • And just for fun adjusted the opacity=0.5 - to make the bars a bit see through
fig = px.histogram(DF,
                   nbins=9,
                   x='loan_amount',
                   color='homeownership',
                   opacity=0.5
                   )

fig.update_layout(bargap=0.1,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5,
                  xaxis_title="Loan Amount ($)",
                  yaxis_title="Frequency",
                  legend_title="Homeownership")
fig.show()

If I want to specify my own colors I can add a discrete color map, where I tell it what color to make each of the categories.

color_discrete_map={'MORTGAGE': 'darkviolet', 
                    'RENT': 'deeppink', 
                    'OWN': 'darkturquoise'}

You have to spell things EXACTLY right otherwise Python will yell at you!

# With different colors
fig = px.histogram(DF,
                   nbins=9,
                   x='loan_amount',
                   color='homeownership',
                   opacity=0.5,
                   color_discrete_map={'MORTGAGE': 'darkviolet', 
                                       'RENT': 'deeppink', 
                                       'OWN': 'darkturquoise'}
                   )

fig.update_layout(bargap=0.1,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5,
                  xaxis_title="Loan Amount ($)",
                  yaxis_title="Frequency",
                  legend_title="Homeownership")
fig.show()

Breaking histogram into Facets

Facets let you break apart the categorical data so that you can look at the histograms for each individual category.

Now I have to add quite a few things to the code!

  • facet_col=‘homeownership’ gives the categorical data used in each separate histogram

  • facet_col_wrap=1 means I want one graph on each line

  • Now there are a bunch of weird lines that just make the labels look good - this starts to get overwhelming but here goes…

    • Take out extra text from the title of each plot

      fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
    • Add the word frequency on each plot

        fig.update_yaxes(title_text='Frequency', col=1)
        fig.update_yaxes(title_text='Frequency', col=2)
        fig.update_yaxes(title_text='Frequency', col=3)
  • Then I want to make the graph bigger so we can actually see all the information so I select a size using the commands:

        autosize=False,
        width=800,
        height=500
fig = px.histogram(DF,
                   nbins=9,
                   x='loan_amount',
                   facet_col='homeownership',
                   facet_col_wrap=1,
                   color='homeownership',
                   color_discrete_map={'MORTGAGE': 'darkviolet', 
                                       'RENT': 'deeppink', 
                                       'OWN': 'darkturquoise'}
                   )

# This just makes the labels look a bit nicer
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))

fig.update_yaxes(title_text='Frequency', col=1)
fig.update_yaxes(title_text='Frequency', col=2)
fig.update_yaxes(title_text='Frequency', col=3)

fig.update_layout(bargap=0.02,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5,
                  xaxis_title="Loan Amount ($)",
                  legend_title="Homeownership",
                  autosize=False,
                  width=800,
                  height=500)


fig.show()

You can do this!

You really can.

Take a deep breath and look back at all the crazy code above. In most of the cells I have just been adding a few new things. I like to SLOWLY build up my graphs. Here is how I start:


fig = px.histogram(DF,
                   x='loan_amount')
fig.show()

Then I notice that I would like some labels


fig = px.histogram(DF,
                   x='loan_amount')

fig.update_layout(title='Histogram of Loan Ammounts.',
                  title_x=0.5,
                  xaxis_title="Loan Amount ($)",
                  yaxis_title='Frequency')
fig.show()

and then maybe I change the colors….

You can do this!

You really can.

Q Create a histogram of your own! Try making a histogram of one of the other pieces of numerical data. Make it as fancy as you want. Include some categorical information. Do you learn anything from your graph? If so what?

Adding Distribution Information to Histogram

A distribution (probability distribution) tells you what the most common (likely) variables are.

The histogram gets at this information. The taller the bar the more of the data that fell into that bin, so if you were randomly going to grab one of the data points, it would more likely come from a “tall” bin.

The marginal variable adds information to the plot that can help you understand the underlying data (distribution)

  • box: Adds a box plot along the margin, summarizing the distribution. Notice if you hover over this plot it will give you the summary statistics.
  • violin: Adds a violin plot along the margin, allowing you to visualize the distribution as a curve.
fig = px.histogram(DF,
                   nbins=9,
                   x='loan_amount',
                   color='homeownership',
                   opacity=0.5,
                   color_discrete_map={'MORTGAGE': 'darkviolet', 
                                       'RENT': 'deeppink', 
                                       'OWN': 'darkturquoise'},
                   marginal="box"
                   )

fig.update_layout(bargap=0.0,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5,
                  xaxis={'range':[-1000, 46000]},
                  xaxis_title="Loan Amount ($)",
                  yaxis_title="Frequency",
                  legend_title="Homeownership",
                  autosize=False,
                  width=800,
                  height=600)
fig.show()

Q Change the above plot to marginal=“violin” and see what changes

Distribution Information

If you just want to consider the distribution of the data. You can plot the violin or box plots separately.

fig = px.violin(DF,
                x='loan_amount', 
                facet_col='homeownership',
                facet_col_wrap=1,
                color='homeownership',
                color_discrete_map={'MORTGAGE': 'darkviolet', 
                                       'RENT': 'deeppink', 
                                       'OWN': 'darkturquoise'},
                width=800,
                height=600)

# This just makes the labels look a bit nicer
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))

fig.show()
fig = px.box(DF,
                x='loan_amount', 
                facet_col='homeownership',
                facet_col_wrap=1,
                color='homeownership',
                color_discrete_map={'MORTGAGE': 'darkviolet', 
                                       'RENT': 'deeppink', 
                                       'OWN': 'darkturquoise'},
                width=800,
                height=600)

# This just makes the labels look a bit nicer
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))

fig.show()

————————————

Scatter Plots

We already saw scatter plots, but here is a refresher.

Scatter plots are a great way to compare relationships between numerical values. This helps us get at the idea of how one variable might depend on the other - how they change in relation to one another.

Remember that you can change the colors and markers as much as you want.

fig = px.scatter(DF,
                 x='debt_to_income',
                 y='interest_rate',
                color_discrete_sequence=['lightseagreen'])

fig.update_layout(title='Debt to Income Ratio vs. Interest Rate',
                  title_x=0.5,
                  xaxis_title="Debt to Income Ratio",
                  yaxis_title="Interest Rate",
                  autosize=False,
                  width=800,
                  height=500)

fig.show()

Heat Maps

Another way to look compare two variables is through a heat map. A heat map combines the idea of a scatter plot (you have x and y locations) and a histogram (you count how many points are within a rectangle).

You don’t have to specify the number of bins in the x and y location. Just leave this out if you want plotly to pick for you.

fig = px.density_heatmap(DF,
                 x='debt_to_income',
                 y='interest_rate',
                 nbinsx=100,
                 nbinsy=50
                 )

fig.update_layout(title='Debt to Income Ratio vs. Interest Rate',
                  title_x=0.5,
                  xaxis_title="Debt to Income Ratio",
                  yaxis_title="Interest Rate",
                  autosize=False,
                  width=800,
                  height=500)

fig.show()

—————————-

Visualizing Categorical Data

For categorical data we have to take a different approach. First of all we can’t do simple summary statistics (unless we first convert our categorical data meaningfully into numerical data).

Bar Plot

A bar plot puts counts of the categorical data into a graph. It is similar to a histogram, except we use the categories as bins. You can also make bar plots to compare categorical and numerical data.

Let’s make a bar plot of the number of people in our data set who rent, own, or have a mortgage.

First let’s ask Python to count up the number of items in each category. Luckily Pandas can do this automatically. We just have to ask our data to look at the column “homeownership” and then count up the different values. This is just good data to have.

counts = DF['homeownership'].value_counts()
counts
homeownership
MORTGAGE    4789
RENT        3858
OWN         1353
Name: count, dtype: int64

Now we can do a part plot where we put the homeownership information on the x-axis.

I had to add a command:

fig.update_traces(dict(marker_line_width=0))

to make the bars a nice solid color.

fig = px.bar(DF,
            x='homeownership',
            color_discrete_sequence=['lightseagreen'])
fig.update_traces(dict(marker_line_width=0))
fig.show()

Look what happens if we change our data to the y-axis.

fig = px.bar(DF,
            y='homeownership',
            color_discrete_sequence=['lightseagreen'])
fig.update_traces(dict(marker_line_width=0))
fig.show()

You can choose what direction you want your bar plots to go!

Customizing Bar Plots

Similar to histograms we can add all sorts of extra information to our bar plots!

Better labels

Q Can you figure out how to add x labels, y labels, and a title to this graph?

fig = px.bar(DF, x=‘homeownership’, color_discrete_sequence=[‘lightseagreen’]) fig.update_traces(dict(marker_line_width=0))

fig.update_layout(title=‘Homeownership Counts’, title_x=0.5, xaxis_title=“Count”, yaxis_title=“Homeownership”, autosize=False, width=800, height=500) fig.show()

Q Try to make your own bar plot of one of the other categorical columns.

Adding categorical fill

You will notice that the commands are really similar to what we did above with histograms. Here we choose a column to identify the color, instead of giving a single color.

fig = px.bar(DF,
            x='homeownership',
            color='grade')
fig.update_traces(dict(marker_line_width=0))

fig.update_layout(title='Homeownership Counts',
                  title_x=0.5,
                  xaxis_title="Count",
                  yaxis_title="Homeownership",
                  autosize=False,
                  width=800,
                  height=500)
fig.show()

Breaking bar graph into Facets

fig = px.bar(DF,
            x='homeownership',
            facet_col='grade',
             facet_col_wrap=2,
            color='grade')
fig.update_traces(dict(marker_line_width=0))

fig.update_layout(title='Homeownership Counts',
                  title_x=0.5,
                  xaxis_title="Count",
                  yaxis_title="Homeownership",
                  autosize=False,
                  width=800,
                  height=500)
fig.show()

There are so many types of graphs!!!

Examples of what plotly can do!

This is just the start of what you can do using Python to create graphs.

Be patient, it takes time to build up a good graph. Start simple and slowly build up adding one thing at a time.

There are even other more fancy graphing modules that you can try out, but for now, let’s just get good a plotly.