Introduction to Data Science

Doing Data Science

Author

Joanna Bieri
DATA101

Important Information

Announcements

In NEXT WEEK - Data Ethics This week you should be reading your book or articles.

Day 12 Assignment - see Web Scraping Notes

The goal of this lecture is to look at the overall data science process and think about what you might include in your Final Project proposals.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

# This stops a few warning messages from showing
pd.options.mode.chained_assignment = None 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

————————

What does it mean to do Data Science?

    graph LR
    subgraph Data Science Initial Exploration
    A((Subject Mater Understanding <br> Define the Question))-->B((Data Gathering <br> Mining <br> Ethical Qusetions )) 
    B-->C((Data Cleaning <br> Wrangling))
    C-->D((Data Exploration <br> Visualization <br> Exploratory Data Analysis))
    end
    
    subgraph Data Science Modeling
    D-->E((Feature Engineering <br> Data Preparation))
    E-->F((Predictive Modeling))
    F-->G((Data Visualization <br> Communication <br> Sharing))
    end
    G-->A

    style D fill:#f9f,stroke:#333,stroke-width:4px
    style G fill:#f9f,stroke:#333,stroke-width:4px

Figure 1: Data Science Lifecycle

Some important things to keep in mind:

  1. It is your job to make sure your results are reproducible. This means that someone else could follow your work and get to the same conclusions.
  2. It is your job to really get to know your data.

Today we will consider more broadly the data science lifecycle and some important parts of doing data science! The content for this class comes from Data Science in a Box - Unit 2 day 17 - Doing Data Science, it has been translated into Python and updated for our class.


Five core activities of data analysis

  1. Stating and refining the question
  2. Exploring the data
  3. Building formal statistical models
  4. Interpreting the results
  5. Communicating the results

Roger D. Peng and Elizabeth Matsui. “The Art of Data Science.” A Guide for Anyone Who Works with Data. Skybrude Consulting, LLC (2015).

How do we come up with Data Science Questions?

  1. Descriptive: summarize a characteristic of a set of data
  2. Exploratory: analyze to see if there are patterns, trends, or relationships between variables (hypothesis generating)
  3. Inferential: analyze patterns, trends, or relationships in representative data from a population
  4. Predictive: make predictions for individuals or groups of individuals
  5. Causal: whether changing one factor will change another factor, on average, in a population
  6. Mechanistic: explore “how” as opposed to whether

Jeffery T. Leek and Roger D. Peng. “What is the question?.” Science 347.6228 (2015): 1314-1315.

Ex: COVID-19 and Vitamin D

  1. Descriptive: frequency of hospitalizations due to COVID-19 in a set of data collected from a group of individuals
  2. Exploratory: examine relationships between a range of dietary factors and COVID-19 hospitalizations
  3. Inferential: examine whether any relationship between taking Vitamin D supplements and COVID-19 hospitalizations found in the sample hold for the population at large
  4. Predictive: what types of people will take Vitamin D supplements during the next year
  5. Causal: whether people with COVID-19 who were randomly assigned to take Vitamin D supplements or those who were not are hospitalized
  6. Mechanistic: how increased vitamin D intake leads to a reduction in the number of viral illnesses

Other questions you should ask

  • Do you have appropriate data to answer your question?
  • Do you have information on confounding variables?
  • Was the data you’re working with collected in a way that introduces bias?

Examples of biased samples:

If you did a survey of just University of Redlands students asking how many people in their household are attending or have attended college, and then try to apply those results to all households in Southern California. This data would be biased toward households with at least one college student.

In an AI-based candidate evaluation tool developed in the mid-2010s by Amazon learned from data on past hiring decisions that were set to exclude women from the pool of qualified candidates.

Philadelphia’s SEPTA security system used algorithms to learn patterns in criminal behavior from datasets that reflect biases in crime, policing, or incarceration trends that disproportionately affect people of color, those algorithms may predict that people of color are more likely to be criminals.

Exploratory data analysis

Checklist

  • Formulate your question
  • Read in your data
  • Check the dimensions
  • Look at the data
  • Validate with at least one external data source
  • Make a plot
  • Try the easy solution first

Formulate your question

  • Consider scope:
    • Are air pollution levels higher on the east coast than on the west coast?
    • Are hourly ozone levels on average higher in New York City than they are in Los Angeles?
    • Do counties in the eastern United States have higher ozone levels than counties in the western United States?
  • Most importantly: “Do I have the right data to answer this question?”

Read in your data

  • Place your data in a folder called data
  • Read it into Python with pd.read_****() depending on the file type
  • Check the data formatting and do some basic cleanup
    • Make sure the data is tidy
    • Change column names
    • Fix strings or categorical information

Check the dimensions

DF.shape

Look at the Data

Searchable data frame:

show(DF)

Just show the data frame:

DF

Look at the top of the data frame:

DF.head()

Look at the bottom of the data frame:

DF.tail()

Validate with at least one external data source

  • Check that the data is reasonable
  • Can you make sure locations, times, etc are correct?

Make a plot

We have been practicing this a lot! Your plot should be:

  • Interesting
  • Useful
  • Well formatted

Try the easy solution first

The most complicated solution is not always the best. Start simple and then go deeper, creating a more extensive analysis.


Communicating for your audience

  • Avoid: Jargon, uninterpreted results, lengthy output
  • Pay attention to: Organization, presentation, flow
  • Don’t forget about: Code style, coding best practices, meaningful commits
  • Be open to: Suggestions, feedback, taking (calculated) risks

Example - Squirrels

  • The Squirrel Census is a multimedia science, design, and storytelling project focusing on the Eastern gray (Sciurus carolinensis). They count squirrels and present their findings to the public.
  • This table contains squirrel data for each of the 3,023 sightings, including location coordinates, age, primary and secondary fur color, elevation, activities, communications, and interactions between squirrels and with humans.

Information about data: mine-cetinkaya-rundel.github.io/nycsquirrels18/reference/squirrels.html

DF = pd.read_csv('data/squirrels.csv')
DF
Unnamed: 0 long lat unique_squirrel_id hectare shift date hectare_squirrel_number age primary_fur_color ... tail_twitches approaches indifferent runs_from other_interactions zip_codes community_districts borough_boundaries city_council_districts police_precincts
0 0 -73.974299 40.777045 13A-PM-1014-04 13A PM 2018-10-14 4.0 NaN Gray ... False False False False NaN NaN 19.0 4.0 19.0 13.0
1 1 -73.968052 40.777339 15F-PM-1010-06 15F PM 2018-10-10 6.0 Adult Gray ... False False True False NaN NaN 19.0 4.0 19.0 13.0
2 2 -73.968807 40.781202 19C-PM-1018-02 19C PM 2018-10-18 2.0 Adult Gray ... False False False False NaN NaN 19.0 4.0 19.0 13.0
3 3 -73.968857 40.783783 21B-AM-1019-04 21B AM 2018-10-19 4.0 NaN NaN ... False False False False NaN NaN 19.0 4.0 19.0 13.0
4 4 -73.969010 40.785336 23A-AM-1018-02 23A AM 2018-10-18 2.0 Juvenile Black ... False False True False NaN NaN 19.0 4.0 19.0 13.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3018 3018 -73.963280 40.780832 21H-PM-1018-01 21H PM 2018-10-18 1.0 Juvenile Gray ... True True False False NaN NaN 19.0 4.0 19.0 13.0
3019 3019 -73.960864 40.790212 31D-PM-1006-02 31D PM 2018-10-06 2.0 Adult Gray ... False False True False NaN NaN 19.0 4.0 19.0 13.0
3020 3020 -73.959838 40.795684 37B-AM-1018-04 37B AM 2018-10-18 4.0 Adult Gray ... True False False False NaN NaN 19.0 4.0 19.0 13.0
3021 3021 -73.967429 40.782972 21C-PM-1006-01 21C PM 2018-10-06 1.0 Adult Gray ... True False False False NaN NaN 19.0 4.0 19.0 13.0
3022 3022 -73.971872 40.770234 7G-PM-1018-04 07G PM 2018-10-18 4.0 Adult Gray ... False False False False NaN NaN 19.0 4.0 19.0 13.0

3023 rows × 36 columns

Check the dimensions and look at the data

DF.shape
(3023, 36)
DF.columns
Index(['Unnamed: 0', 'long', 'lat', 'unique_squirrel_id', 'hectare', 'shift',
       'date', 'hectare_squirrel_number', 'age', 'primary_fur_color',
       'highlight_fur_color', 'combination_of_primary_and_highlight_color',
       'color_notes', 'location', 'above_ground_sighter_measurement',
       'specific_location', 'running', 'chasing', 'climbing', 'eating',
       'foraging', 'other_activities', 'kuks', 'quaas', 'moans', 'tail_flags',
       'tail_twitches', 'approaches', 'indifferent', 'runs_from',
       'other_interactions', 'zip_codes', 'community_districts',
       'borough_boundaries', 'city_council_districts', 'police_precincts'],
      dtype='object')
DF.head(10)
Unnamed: 0 long lat unique_squirrel_id hectare shift date hectare_squirrel_number age primary_fur_color ... tail_twitches approaches indifferent runs_from other_interactions zip_codes community_districts borough_boundaries city_council_districts police_precincts
0 0 -73.974299 40.777045 13A-PM-1014-04 13A PM 2018-10-14 4.0 NaN Gray ... False False False False NaN NaN 19.0 4.0 19.0 13.0
1 1 -73.968052 40.777339 15F-PM-1010-06 15F PM 2018-10-10 6.0 Adult Gray ... False False True False NaN NaN 19.0 4.0 19.0 13.0
2 2 -73.968807 40.781202 19C-PM-1018-02 19C PM 2018-10-18 2.0 Adult Gray ... False False False False NaN NaN 19.0 4.0 19.0 13.0
3 3 -73.968857 40.783783 21B-AM-1019-04 21B AM 2018-10-19 4.0 NaN NaN ... False False False False NaN NaN 19.0 4.0 19.0 13.0
4 4 -73.969010 40.785336 23A-AM-1018-02 23A AM 2018-10-18 2.0 Juvenile Black ... False False True False NaN NaN 19.0 4.0 19.0 13.0
5 5 -73.953123 40.794111 38H-PM-1012-01 38H PM 2018-10-12 1.0 Adult Gray ... False False False True NaN NaN 19.0 4.0 19.0 13.0
6 6 -73.977216 40.768840 3D-AM-1006-06 03D AM 2018-10-06 6.0 NaN Gray ... False False False False did not run from humans,but ran straight to tr... NaN 19.0 4.0 19.0 13.0
7 7 -73.956576 40.799246 42C-AM-1007-02 42C AM 2018-10-07 2.0 NaN NaN ... False False False False NaN NaN 19.0 4.0 19.0 13.0
8 8 -73.976798 40.774363 9A-PM-1010-02 09A PM 2018-10-10 2.0 Adult NaN ... False False True False NaN NaN 19.0 4.0 19.0 13.0
9 9 -73.975292 40.773933 9B-AM-1010-04 09B AM 2018-10-10 4.0 NaN NaN ... False False False False NaN NaN 19.0 4.0 19.0 13.0

10 rows × 36 columns

DF.tail(10)
Unnamed: 0 long lat unique_squirrel_id hectare shift date hectare_squirrel_number age primary_fur_color ... tail_twitches approaches indifferent runs_from other_interactions zip_codes community_districts borough_boundaries city_council_districts police_precincts
3013 3013 -73.971465 40.771740 9F-AM-1013-03 09F AM 2018-10-13 3.0 Adult Gray ... False False False False NaN NaN 19.0 4.0 19.0 13.0
3014 3014 -73.974692 40.765523 1G-AM-1006-02 01G AM 2018-10-06 2.0 Juvenile Gray ... True False True False NaN NaN 19.0 4.0 19.0 13.0
3015 3015 -73.976363 40.768232 3E-PM-1008-04 03E PM 2018-10-08 4.0 Adult Gray ... False False False True NaN NaN 19.0 4.0 19.0 13.0
3016 3016 -73.974329 40.775609 11B-AM-1007-05 11B AM 2018-10-07 5.0 Adult Gray ... False True False False NaN NaN 19.0 4.0 19.0 13.0
3017 3017 -73.975427 40.770241 6D-PM-1020-01 06D PM 2018-10-20 1.0 Adult Gray ... False False True False NaN NaN 19.0 4.0 19.0 13.0
3018 3018 -73.963280 40.780832 21H-PM-1018-01 21H PM 2018-10-18 1.0 Juvenile Gray ... True True False False NaN NaN 19.0 4.0 19.0 13.0
3019 3019 -73.960864 40.790212 31D-PM-1006-02 31D PM 2018-10-06 2.0 Adult Gray ... False False True False NaN NaN 19.0 4.0 19.0 13.0
3020 3020 -73.959838 40.795684 37B-AM-1018-04 37B AM 2018-10-18 4.0 Adult Gray ... True False False False NaN NaN 19.0 4.0 19.0 13.0
3021 3021 -73.967429 40.782972 21C-PM-1006-01 21C PM 2018-10-06 1.0 Adult Gray ... True False False False NaN NaN 19.0 4.0 19.0 13.0
3022 3022 -73.971872 40.770234 7G-PM-1018-04 07G PM 2018-10-18 4.0 Adult Gray ... False False False False NaN NaN 19.0 4.0 19.0 13.0

10 rows × 36 columns

Try to validate the data with external sources

In this data set we have some latitude and longitude data. Is this actually central park?

YES - we can google the coordinates:

40.7826° N, 73.9656° W Latitude and longitude coordinates are: 40.785091, -73.968285. One of the key landmarks of New York City and Manhattan, Central Park is one of the most famous parks in the world. It enjoys a great location right between the Upper East and the Upper West Sides of Manhattan.

So this data seems reasonable in terms of location.

Make some plots

Maybe plot the latitude and longitude - what is the shape?

fig = px.scatter(DF,x='long',y='lat',opacity=.3,color_discrete_sequence=['gray'])

fig.show()

Central park is shaped like this, and it makes sense that there might be empty spots where there is water or where humans cant reach to record a sighting.

As you are making plots start asking questions.

  • Are there certain areas where there are more sightings?
  • Are there more juvenile or adult sightings?
  • What are the most common behaviors?

Lets go for some easy answers!

fig = px.scatter(DF,x='long',y='lat',opacity=.3,color='age')

fig.show()
behaviors = ['running', 'chasing', 'climbing', 'eating',
       'foraging', 'other_activities']

DF_behavior = DF[behaviors]
DF_behavior.value_counts()
running  chasing  climbing  eating  foraging  other_activities           
False    False    False     False   True      digging                        12
                                    False     sitting                        11
                                    True      walking                         7
                                              burying                         5
                                    False     playing                         5
                                                                             ..
                                    True      scratching!                     1
                                              scratching self                 1
                                              scanning (drawing included)     1
                                              quietly                         1
True     True     True      True    False     playing tag together            1
Name: count, Length: 361, dtype: int64

Then think about going deeper

  • Browse the data - look for weird things
  • Especially looks at notes or comments rows
  • What more would I like to know?
  • Do I have a hypothesis?