Introduction to Data Science

Doing Data Science

Author

Joanna Bieri
DATA101

Important Information

Email: joanna_bieri@redlands.edu
Office Hours: Duke 209 Click Here for Joanna’s Schedule

Announcements

In NEXT WEEK - Data Ethics This week you should be reading your book or articles.

Day 12 Assignment - see Web Scraping Notes

The goal of this lecture is to look at the overall data science process and think about what you might include in your Final Project proposals.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

# This stops a few warning messages from showing
pd.options.mode.chained_assignment = None 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

————————

What does it mean to do Data Science?

    graph LR
    subgraph Data Science Initial Exploration
    A((Subject Mater Understanding <br> Define the Question))-->B((Data Gathering <br> Mining <br> Ethical Qusetions )) 
    B-->C((Data Cleaning <br> Wrangling))
    C-->D((Data Exploration <br> Visualization <br> Exploratory Data Analysis))
    end
    
    subgraph Data Science Modeling
    D-->E((Feature Engineering <br> Data Preparation))
    E-->F((Predictive Modeling))
    F-->G((Data Visualization <br> Communication <br> Sharing))
    end
    G-->A

    style D fill:#f9f,stroke:#333,stroke-width:4px
    style G fill:#f9f,stroke:#333,stroke-width:4px

Figure 1: Data Science Lifecycle

Some important things to keep in mind:

It is your job to make sure your results are reproducible. This means that someone else could follow your work and get to the same conclusions.
It is your job to really get to know your data.

Today we will consider more broadly the data science lifecycle and some important parts of doing data science! The content for this class comes from Data Science in a Box - Unit 2 day 17 - Doing Data Science, it has been translated into Python and updated for our class.

Five core activities of data analysis

Stating and refining the question
Exploring the data
Building formal statistical models
Interpreting the results
Communicating the results

Roger D. Peng and Elizabeth Matsui. “The Art of Data Science.” A Guide for Anyone Who Works with Data. Skybrude Consulting, LLC (2015).

How do we come up with Data Science Questions?

Descriptive: summarize a characteristic of a set of data
Exploratory: analyze to see if there are patterns, trends, or relationships between variables (hypothesis generating)
Inferential: analyze patterns, trends, or relationships in representative data from a population
Predictive: make predictions for individuals or groups of individuals
Causal: whether changing one factor will change another factor, on average, in a population
Mechanistic: explore “how” as opposed to whether

Jeffery T. Leek and Roger D. Peng. “What is the question?.” Science 347.6228 (2015): 1314-1315.

Ex: COVID-19 and Vitamin D

Descriptive: frequency of hospitalizations due to COVID-19 in a set of data collected from a group of individuals
Exploratory: examine relationships between a range of dietary factors and COVID-19 hospitalizations
Inferential: examine whether any relationship between taking Vitamin D supplements and COVID-19 hospitalizations found in the sample hold for the population at large
Predictive: what types of people will take Vitamin D supplements during the next year
Causal: whether people with COVID-19 who were randomly assigned to take Vitamin D supplements or those who were not are hospitalized
Mechanistic: how increased vitamin D intake leads to a reduction in the number of viral illnesses

Exploratory data analysis

Checklist

Formulate your question
Read in your data
Check the dimensions
Look at the data
Validate with at least one external data source
Make a plot
Try the easy solution first

Formulate your question

Consider scope:
- Are air pollution levels higher on the east coast than on the west coast?
- Are hourly ozone levels on average higher in New York City than they are in Los Angeles?
- Do counties in the eastern United States have higher ozone levels than counties in the western United States?
Most importantly: “Do I have the right data to answer this question?”

Read in your data

Place your data in a folder called data
Read it into Python with pd.read_****() depending on the file type
Check the data formatting and do some basic cleanup
- Make sure the data is tidy
- Change column names
- Fix strings or categorical information

Check the dimensions

DF.shape

Look at the Data

Searchable data frame:

show(DF)

Just show the data frame:

DF

Look at the top of the data frame:

DF.head()

Look at the bottom of the data frame:

DF.tail()

Validate with at least one external data source

Check that the data is reasonable
Can you make sure locations, times, etc are correct?

Make a plot

We have been practicing this a lot! Your plot should be:

Interesting
Useful
Well formatted

Try the easy solution first

The most complicated solution is not always the best. Start simple and then go deeper, creating a more extensive analysis.

Communicating for your audience

Avoid: Jargon, uninterpreted results, lengthy output
Pay attention to: Organization, presentation, flow
Don’t forget about: Code style, coding best practices, meaningful commits
Be open to: Suggestions, feedback, taking (calculated) risks

Example - Squirrels

The Squirrel Census is a multimedia science, design, and storytelling project focusing on the Eastern gray (Sciurus carolinensis). They count squirrels and present their findings to the public.
This table contains squirrel data for each of the 3,023 sightings, including location coordinates, age, primary and secondary fur color, elevation, activities, communications, and interactions between squirrels and with humans.

Information about data: mine-cetinkaya-rundel.github.io/nycsquirrels18/reference/squirrels.html

DF = pd.read_csv('data/squirrels.csv')
DF

	Unnamed: 0	long	lat	unique_squirrel_id	hectare	shift	date	hectare_squirrel_number	age	primary_fur_color	...	tail_twitches	approaches	indifferent	runs_from	other_interactions	zip_codes	community_districts	borough_boundaries	city_council_districts	police_precincts
0	0	-73.974299	40.777045	13A-PM-1014-04	13A	PM	2018-10-14	4.0	NaN	Gray	...	False	False	False	False	NaN	NaN	19.0	4.0	19.0	13.0
1	1	-73.968052	40.777339	15F-PM-1010-06	15F	PM	2018-10-10	6.0	Adult	Gray	...	False	False	True	False	NaN	NaN	19.0	4.0	19.0	13.0
2	2	-73.968807	40.781202	19C-PM-1018-02	19C	PM	2018-10-18	2.0	Adult	Gray	...	False	False	False	False	NaN	NaN	19.0	4.0	19.0	13.0
3	3	-73.968857	40.783783	21B-AM-1019-04	21B	AM	2018-10-19	4.0	NaN	NaN	...	False	False	False	False	NaN	NaN	19.0	4.0	19.0	13.0
4	4	-73.969010	40.785336	23A-AM-1018-02	23A	AM	2018-10-18	2.0	Juvenile	Black	...	False	False	True	False	NaN	NaN	19.0	4.0	19.0	13.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3018	3018	-73.963280	40.780832	21H-PM-1018-01	21H	PM	2018-10-18	1.0	Juvenile	Gray	...	True	True	False	False	NaN	NaN	19.0	4.0	19.0	13.0
3019	3019	-73.960864	40.790212	31D-PM-1006-02	31D	PM	2018-10-06	2.0	Adult	Gray	...	False	False	True	False	NaN	NaN	19.0	4.0	19.0	13.0
3020	3020	-73.959838	40.795684	37B-AM-1018-04	37B	AM	2018-10-18	4.0	Adult	Gray	...	True	False	False	False	NaN	NaN	19.0	4.0	19.0	13.0
3021	3021	-73.967429	40.782972	21C-PM-1006-01	21C	PM	2018-10-06	1.0	Adult	Gray	...	True	False	False	False	NaN	NaN	19.0	4.0	19.0	13.0
3022	3022	-73.971872	40.770234	7G-PM-1018-04	07G	PM	2018-10-18	4.0	Adult	Gray	...	False	False	False	False	NaN	NaN	19.0	4.0	19.0	13.0

3023 rows × 36 columns

Check the dimensions and look at the data

DF.shape

(3023, 36)

DF.columns

Index(['Unnamed: 0', 'long', 'lat', 'unique_squirrel_id', 'hectare', 'shift',
       'date', 'hectare_squirrel_number', 'age', 'primary_fur_color',
       'highlight_fur_color', 'combination_of_primary_and_highlight_color',
       'color_notes', 'location', 'above_ground_sighter_measurement',
       'specific_location', 'running', 'chasing', 'climbing', 'eating',
       'foraging', 'other_activities', 'kuks', 'quaas', 'moans', 'tail_flags',
       'tail_twitches', 'approaches', 'indifferent', 'runs_from',
       'other_interactions', 'zip_codes', 'community_districts',
       'borough_boundaries', 'city_council_districts', 'police_precincts'],
      dtype='object')

DF.head(10)

	Unnamed: 0	long	lat	unique_squirrel_id	hectare	shift	date	hectare_squirrel_number	age	primary_fur_color	...	tail_twitches	approaches	indifferent	runs_from	other_interactions	zip_codes	community_districts	borough_boundaries	city_council_districts	police_precincts
0	0	-73.974299	40.777045	13A-PM-1014-04	13A	PM	2018-10-14	4.0	NaN	Gray	...	False	False	False	False	NaN	NaN	19.0	4.0	19.0	13.0
1	1	-73.968052	40.777339	15F-PM-1010-06	15F	PM	2018-10-10	6.0	Adult	Gray	...	False	False	True	False	NaN	NaN	19.0	4.0	19.0	13.0
2	2	-73.968807	40.781202	19C-PM-1018-02	19C	PM	2018-10-18	2.0	Adult	Gray	...	False	False	False	False	NaN	NaN	19.0	4.0	19.0	13.0
3	3	-73.968857	40.783783	21B-AM-1019-04	21B	AM	2018-10-19	4.0	NaN	NaN	...	False	False	False	False	NaN	NaN	19.0	4.0	19.0	13.0
4	4	-73.969010	40.785336	23A-AM-1018-02	23A	AM	2018-10-18	2.0	Juvenile	Black	...	False	False	True	False	NaN	NaN	19.0	4.0	19.0	13.0
5	5	-73.953123	40.794111	38H-PM-1012-01	38H	PM	2018-10-12	1.0	Adult	Gray	...	False	False	False	True	NaN	NaN	19.0	4.0	19.0	13.0
6	6	-73.977216	40.768840	3D-AM-1006-06	03D	AM	2018-10-06	6.0	NaN	Gray	...	False	False	False	False	did not run from humans,but ran straight to tr...	NaN	19.0	4.0	19.0	13.0
7	7	-73.956576	40.799246	42C-AM-1007-02	42C	AM	2018-10-07	2.0	NaN	NaN	...	False	False	False	False	NaN	NaN	19.0	4.0	19.0	13.0
8	8	-73.976798	40.774363	9A-PM-1010-02	09A	PM	2018-10-10	2.0	Adult	NaN	...	False	False	True	False	NaN	NaN	19.0	4.0	19.0	13.0
9	9	-73.975292	40.773933	9B-AM-1010-04	09B	AM	2018-10-10	4.0	NaN	NaN	...	False	False	False	False	NaN	NaN	19.0	4.0	19.0	13.0

10 rows × 36 columns

DF.tail(10)

	Unnamed: 0	long	lat	unique_squirrel_id	hectare	shift	date	hectare_squirrel_number	age	primary_fur_color	...	tail_twitches	approaches	indifferent	runs_from	other_interactions	zip_codes	community_districts	borough_boundaries	city_council_districts	police_precincts
3013	3013	-73.971465	40.771740	9F-AM-1013-03	09F	AM	2018-10-13	3.0	Adult	Gray	...	False	False	False	False	NaN	NaN	19.0	4.0	19.0	13.0
3014	3014	-73.974692	40.765523	1G-AM-1006-02	01G	AM	2018-10-06	2.0	Juvenile	Gray	...	True	False	True	False	NaN	NaN	19.0	4.0	19.0	13.0
3015	3015	-73.976363	40.768232	3E-PM-1008-04	03E	PM	2018-10-08	4.0	Adult	Gray	...	False	False	False	True	NaN	NaN	19.0	4.0	19.0	13.0
3016	3016	-73.974329	40.775609	11B-AM-1007-05	11B	AM	2018-10-07	5.0	Adult	Gray	...	False	True	False	False	NaN	NaN	19.0	4.0	19.0	13.0
3017	3017	-73.975427	40.770241	6D-PM-1020-01	06D	PM	2018-10-20	1.0	Adult	Gray	...	False	False	True	False	NaN	NaN	19.0	4.0	19.0	13.0
3018	3018	-73.963280	40.780832	21H-PM-1018-01	21H	PM	2018-10-18	1.0	Juvenile	Gray	...	True	True	False	False	NaN	NaN	19.0	4.0	19.0	13.0
3019	3019	-73.960864	40.790212	31D-PM-1006-02	31D	PM	2018-10-06	2.0	Adult	Gray	...	False	False	True	False	NaN	NaN	19.0	4.0	19.0	13.0
3020	3020	-73.959838	40.795684	37B-AM-1018-04	37B	AM	2018-10-18	4.0	Adult	Gray	...	True	False	False	False	NaN	NaN	19.0	4.0	19.0	13.0
3021	3021	-73.967429	40.782972	21C-PM-1006-01	21C	PM	2018-10-06	1.0	Adult	Gray	...	True	False	False	False	NaN	NaN	19.0	4.0	19.0	13.0
3022	3022	-73.971872	40.770234	7G-PM-1018-04	07G	PM	2018-10-18	4.0	Adult	Gray	...	False	False	False	False	NaN	NaN	19.0	4.0	19.0	13.0

10 rows × 36 columns

Try to validate the data with external sources

In this data set we have some latitude and longitude data. Is this actually central park?

YES - we can google the coordinates:

40.7826° N, 73.9656° W Latitude and longitude coordinates are: 40.785091, -73.968285. One of the key landmarks of New York City and Manhattan, Central Park is one of the most famous parks in the world. It enjoys a great location right between the Upper East and the Upper West Sides of Manhattan.

So this data seems reasonable in terms of location.

Make some plots

Maybe plot the latitude and longitude - what is the shape?

fig = px.scatter(DF,x='long',y='lat',opacity=.3,color_discrete_sequence=['gray'])

fig.show()

Central park is shaped like this, and it makes sense that there might be empty spots where there is water or where humans cant reach to record a sighting.

As you are making plots start asking questions.

Are there certain areas where there are more sightings?
Are there more juvenile or adult sightings?
What are the most common behaviors?

Lets go for some easy answers!

fig = px.scatter(DF,x='long',y='lat',opacity=.3,color='age')

fig.show()

behaviors = ['running', 'chasing', 'climbing', 'eating',
       'foraging', 'other_activities']

DF_behavior = DF[behaviors]
DF_behavior.value_counts()

running  chasing  climbing  eating  foraging  other_activities           
False    False    False     False   True      digging                        12
                                    False     sitting                        11
                                    True      walking                         7
                                              burying                         5
                                    False     playing                         5
                                                                             ..
                                    True      scratching!                     1
                                              scratching self                 1
                                              scanning (drawing included)     1
                                              quietly                         1
True     True     True      True    False     playing tag together            1
Name: count, Length: 361, dtype: int64

Then think about going deeper

Browse the data - look for weird things
Especially looks at notes or comments rows
What more would I like to know?
Do I have a hypothesis?