import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
= 'colab'
pio.renderers.defaule
from itables import show
# This stops a few warning messages from showing
= None
pd.options.mode.chained_assignment import warnings
='ignore', category=FutureWarning) warnings.simplefilter(action
Introduction to Data Science
Doing Data Science
Important Information
- Email: joanna_bieri@redlands.edu
- Office Hours: Duke 209 Click Here for Joanna’s Schedule
Announcements
In NEXT WEEK - Data Ethics This week you should be reading your book or articles.
Day 12 Assignment - see Web Scraping Notes
The goal of this lecture is to look at the overall data science process and think about what you might include in your Final Project proposals.
————————
What does it mean to do Data Science?
graph LR subgraph Data Science Initial Exploration A((Subject Mater Understanding <br> Define the Question))-->B((Data Gathering <br> Mining <br> Ethical Qusetions )) B-->C((Data Cleaning <br> Wrangling)) C-->D((Data Exploration <br> Visualization <br> Exploratory Data Analysis)) end subgraph Data Science Modeling D-->E((Feature Engineering <br> Data Preparation)) E-->F((Predictive Modeling)) F-->G((Data Visualization <br> Communication <br> Sharing)) end G-->A style D fill:#f9f,stroke:#333,stroke-width:4px style G fill:#f9f,stroke:#333,stroke-width:4px
Some important things to keep in mind:
- It is your job to make sure your results are reproducible. This means that someone else could follow your work and get to the same conclusions.
- It is your job to really get to know your data.
Today we will consider more broadly the data science lifecycle and some important parts of doing data science! The content for this class comes from Data Science in a Box - Unit 2 day 17 - Doing Data Science, it has been translated into Python and updated for our class.
Five core activities of data analysis
- Stating and refining the question
- Exploring the data
- Building formal statistical models
- Interpreting the results
- Communicating the results
Roger D. Peng and Elizabeth Matsui. “The Art of Data Science.” A Guide for Anyone Who Works with Data. Skybrude Consulting, LLC (2015).
How do we come up with Data Science Questions?
- Descriptive: summarize a characteristic of a set of data
- Exploratory: analyze to see if there are patterns, trends, or relationships between variables (hypothesis generating)
- Inferential: analyze patterns, trends, or relationships in representative data from a population
- Predictive: make predictions for individuals or groups of individuals
- Causal: whether changing one factor will change another factor, on average, in a population
- Mechanistic: explore “how” as opposed to whether
Jeffery T. Leek and Roger D. Peng. “What is the question?.” Science 347.6228 (2015): 1314-1315.
Ex: COVID-19 and Vitamin D
- Descriptive: frequency of hospitalizations due to COVID-19 in a set of data collected from a group of individuals
- Exploratory: examine relationships between a range of dietary factors and COVID-19 hospitalizations
- Inferential: examine whether any relationship between taking Vitamin D supplements and COVID-19 hospitalizations found in the sample hold for the population at large
- Predictive: what types of people will take Vitamin D supplements during the next year
- Causal: whether people with COVID-19 who were randomly assigned to take Vitamin D supplements or those who were not are hospitalized
- Mechanistic: how increased vitamin D intake leads to a reduction in the number of viral illnesses
Other questions you should ask
- Do you have appropriate data to answer your question?
- Do you have information on confounding variables?
- Was the data you’re working with collected in a way that introduces bias?
Examples of biased samples:
If you did a survey of just University of Redlands students asking how many people in their household are attending or have attended college, and then try to apply those results to all households in Southern California. This data would be biased toward households with at least one college student.
In an AI-based candidate evaluation tool developed in the mid-2010s by Amazon learned from data on past hiring decisions that were set to exclude women from the pool of qualified candidates.
Philadelphia’s SEPTA security system used algorithms to learn patterns in criminal behavior from datasets that reflect biases in crime, policing, or incarceration trends that disproportionately affect people of color, those algorithms may predict that people of color are more likely to be criminals.
Exploratory data analysis
Checklist
- Formulate your question
- Read in your data
- Check the dimensions
- Look at the data
- Validate with at least one external data source
- Make a plot
- Try the easy solution first
Formulate your question
- Consider scope:
- Are air pollution levels higher on the east coast than on the west coast?
- Are hourly ozone levels on average higher in New York City than they are in Los Angeles?
- Do counties in the eastern United States have higher ozone levels than counties in the western United States?
- Most importantly: “Do I have the right data to answer this question?”
Read in your data
- Place your data in a folder called
data
- Read it into Python with
pd.read_****()
depending on the file type - Check the data formatting and do some basic cleanup
- Make sure the data is tidy
- Change column names
- Fix strings or categorical information
Check the dimensions
DF.shape
Look at the Data
Searchable data frame:
show(DF)
Just show the data frame:
DF
Look at the top of the data frame:
DF.head()
Look at the bottom of the data frame:
DF.tail()
Validate with at least one external data source
- Check that the data is reasonable
Can you make sure locations, times, etc are correct?
Make a plot
We have been practicing this a lot! Your plot should be:
- Interesting
- Useful
- Well formatted
Try the easy solution first
The most complicated solution is not always the best. Start simple and then go deeper, creating a more extensive analysis.
Communicating for your audience
- Avoid: Jargon, uninterpreted results, lengthy output
- Pay attention to: Organization, presentation, flow
- Don’t forget about: Code style, coding best practices, meaningful commits
- Be open to: Suggestions, feedback, taking (calculated) risks
Example - Squirrels
- The Squirrel Census is a multimedia science, design, and storytelling project focusing on the Eastern gray (Sciurus carolinensis). They count squirrels and present their findings to the public.
- This table contains squirrel data for each of the 3,023 sightings, including location coordinates, age, primary and secondary fur color, elevation, activities, communications, and interactions between squirrels and with humans.
Information about data: mine-cetinkaya-rundel.github.io/nycsquirrels18/reference/squirrels.html
= pd.read_csv('data/squirrels.csv')
DF DF
Unnamed: 0 | long | lat | unique_squirrel_id | hectare | shift | date | hectare_squirrel_number | age | primary_fur_color | ... | tail_twitches | approaches | indifferent | runs_from | other_interactions | zip_codes | community_districts | borough_boundaries | city_council_districts | police_precincts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | -73.974299 | 40.777045 | 13A-PM-1014-04 | 13A | PM | 2018-10-14 | 4.0 | NaN | Gray | ... | False | False | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
1 | 1 | -73.968052 | 40.777339 | 15F-PM-1010-06 | 15F | PM | 2018-10-10 | 6.0 | Adult | Gray | ... | False | False | True | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
2 | 2 | -73.968807 | 40.781202 | 19C-PM-1018-02 | 19C | PM | 2018-10-18 | 2.0 | Adult | Gray | ... | False | False | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3 | 3 | -73.968857 | 40.783783 | 21B-AM-1019-04 | 21B | AM | 2018-10-19 | 4.0 | NaN | NaN | ... | False | False | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
4 | 4 | -73.969010 | 40.785336 | 23A-AM-1018-02 | 23A | AM | 2018-10-18 | 2.0 | Juvenile | Black | ... | False | False | True | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3018 | 3018 | -73.963280 | 40.780832 | 21H-PM-1018-01 | 21H | PM | 2018-10-18 | 1.0 | Juvenile | Gray | ... | True | True | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3019 | 3019 | -73.960864 | 40.790212 | 31D-PM-1006-02 | 31D | PM | 2018-10-06 | 2.0 | Adult | Gray | ... | False | False | True | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3020 | 3020 | -73.959838 | 40.795684 | 37B-AM-1018-04 | 37B | AM | 2018-10-18 | 4.0 | Adult | Gray | ... | True | False | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3021 | 3021 | -73.967429 | 40.782972 | 21C-PM-1006-01 | 21C | PM | 2018-10-06 | 1.0 | Adult | Gray | ... | True | False | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3022 | 3022 | -73.971872 | 40.770234 | 7G-PM-1018-04 | 07G | PM | 2018-10-18 | 4.0 | Adult | Gray | ... | False | False | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3023 rows × 36 columns
Check the dimensions and look at the data
DF.shape
(3023, 36)
DF.columns
Index(['Unnamed: 0', 'long', 'lat', 'unique_squirrel_id', 'hectare', 'shift',
'date', 'hectare_squirrel_number', 'age', 'primary_fur_color',
'highlight_fur_color', 'combination_of_primary_and_highlight_color',
'color_notes', 'location', 'above_ground_sighter_measurement',
'specific_location', 'running', 'chasing', 'climbing', 'eating',
'foraging', 'other_activities', 'kuks', 'quaas', 'moans', 'tail_flags',
'tail_twitches', 'approaches', 'indifferent', 'runs_from',
'other_interactions', 'zip_codes', 'community_districts',
'borough_boundaries', 'city_council_districts', 'police_precincts'],
dtype='object')
10) DF.head(
Unnamed: 0 | long | lat | unique_squirrel_id | hectare | shift | date | hectare_squirrel_number | age | primary_fur_color | ... | tail_twitches | approaches | indifferent | runs_from | other_interactions | zip_codes | community_districts | borough_boundaries | city_council_districts | police_precincts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | -73.974299 | 40.777045 | 13A-PM-1014-04 | 13A | PM | 2018-10-14 | 4.0 | NaN | Gray | ... | False | False | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
1 | 1 | -73.968052 | 40.777339 | 15F-PM-1010-06 | 15F | PM | 2018-10-10 | 6.0 | Adult | Gray | ... | False | False | True | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
2 | 2 | -73.968807 | 40.781202 | 19C-PM-1018-02 | 19C | PM | 2018-10-18 | 2.0 | Adult | Gray | ... | False | False | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3 | 3 | -73.968857 | 40.783783 | 21B-AM-1019-04 | 21B | AM | 2018-10-19 | 4.0 | NaN | NaN | ... | False | False | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
4 | 4 | -73.969010 | 40.785336 | 23A-AM-1018-02 | 23A | AM | 2018-10-18 | 2.0 | Juvenile | Black | ... | False | False | True | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
5 | 5 | -73.953123 | 40.794111 | 38H-PM-1012-01 | 38H | PM | 2018-10-12 | 1.0 | Adult | Gray | ... | False | False | False | True | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
6 | 6 | -73.977216 | 40.768840 | 3D-AM-1006-06 | 03D | AM | 2018-10-06 | 6.0 | NaN | Gray | ... | False | False | False | False | did not run from humans,but ran straight to tr... | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
7 | 7 | -73.956576 | 40.799246 | 42C-AM-1007-02 | 42C | AM | 2018-10-07 | 2.0 | NaN | NaN | ... | False | False | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
8 | 8 | -73.976798 | 40.774363 | 9A-PM-1010-02 | 09A | PM | 2018-10-10 | 2.0 | Adult | NaN | ... | False | False | True | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
9 | 9 | -73.975292 | 40.773933 | 9B-AM-1010-04 | 09B | AM | 2018-10-10 | 4.0 | NaN | NaN | ... | False | False | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
10 rows × 36 columns
10) DF.tail(
Unnamed: 0 | long | lat | unique_squirrel_id | hectare | shift | date | hectare_squirrel_number | age | primary_fur_color | ... | tail_twitches | approaches | indifferent | runs_from | other_interactions | zip_codes | community_districts | borough_boundaries | city_council_districts | police_precincts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3013 | 3013 | -73.971465 | 40.771740 | 9F-AM-1013-03 | 09F | AM | 2018-10-13 | 3.0 | Adult | Gray | ... | False | False | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3014 | 3014 | -73.974692 | 40.765523 | 1G-AM-1006-02 | 01G | AM | 2018-10-06 | 2.0 | Juvenile | Gray | ... | True | False | True | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3015 | 3015 | -73.976363 | 40.768232 | 3E-PM-1008-04 | 03E | PM | 2018-10-08 | 4.0 | Adult | Gray | ... | False | False | False | True | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3016 | 3016 | -73.974329 | 40.775609 | 11B-AM-1007-05 | 11B | AM | 2018-10-07 | 5.0 | Adult | Gray | ... | False | True | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3017 | 3017 | -73.975427 | 40.770241 | 6D-PM-1020-01 | 06D | PM | 2018-10-20 | 1.0 | Adult | Gray | ... | False | False | True | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3018 | 3018 | -73.963280 | 40.780832 | 21H-PM-1018-01 | 21H | PM | 2018-10-18 | 1.0 | Juvenile | Gray | ... | True | True | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3019 | 3019 | -73.960864 | 40.790212 | 31D-PM-1006-02 | 31D | PM | 2018-10-06 | 2.0 | Adult | Gray | ... | False | False | True | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3020 | 3020 | -73.959838 | 40.795684 | 37B-AM-1018-04 | 37B | AM | 2018-10-18 | 4.0 | Adult | Gray | ... | True | False | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3021 | 3021 | -73.967429 | 40.782972 | 21C-PM-1006-01 | 21C | PM | 2018-10-06 | 1.0 | Adult | Gray | ... | True | False | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
3022 | 3022 | -73.971872 | 40.770234 | 7G-PM-1018-04 | 07G | PM | 2018-10-18 | 4.0 | Adult | Gray | ... | False | False | False | False | NaN | NaN | 19.0 | 4.0 | 19.0 | 13.0 |
10 rows × 36 columns
Try to validate the data with external sources
In this data set we have some latitude and longitude data. Is this actually central park?
YES - we can google the coordinates:
40.7826° N, 73.9656° W Latitude and longitude coordinates are: 40.785091, -73.968285. One of the key landmarks of New York City and Manhattan, Central Park is one of the most famous parks in the world. It enjoys a great location right between the Upper East and the Upper West Sides of Manhattan.
So this data seems reasonable in terms of location.
Make some plots
Maybe plot the latitude and longitude - what is the shape?
= px.scatter(DF,x='long',y='lat',opacity=.3,color_discrete_sequence=['gray'])
fig
fig.show()
Central park is shaped like this, and it makes sense that there might be empty spots where there is water or where humans cant reach to record a sighting.
As you are making plots start asking questions.
- Are there certain areas where there are more sightings?
- Are there more juvenile or adult sightings?
- What are the most common behaviors?
Lets go for some easy answers!
= px.scatter(DF,x='long',y='lat',opacity=.3,color='age')
fig
fig.show()
= ['running', 'chasing', 'climbing', 'eating',
behaviors 'foraging', 'other_activities']
= DF[behaviors]
DF_behavior DF_behavior.value_counts()
running chasing climbing eating foraging other_activities
False False False False True digging 12
False sitting 11
True walking 7
burying 5
False playing 5
..
True scratching! 1
scratching self 1
scanning (drawing included) 1
quietly 1
True True True True False playing tag together 1
Name: count, Length: 361, dtype: int64
Then think about going deeper
- Browse the data - look for weird things
- Especially looks at notes or comments rows
- What more would I like to know?
- Do I have a hypothesis?