import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
= 'colab'
pio.renderers.defaule
from itables import show
Introduction to Data Science
Reading Data and Data Types
Important Information
- Email: joanna_bieri@redlands.edu
- Office Hours: Duke 209 Click Here for Joanna’s Schedule
Announcements
Exam 1 - what to expect…
Please come to office hours to get help!
Day 8 Assignment - same drill.
- Make sure you can Fork and Clone the Day8 repo from Redlands-DATA101
- Open the file Day8-HW.ipynb and start doing the problems.
- You can do these problems as you follow along with the lecture notes and video.
- Get as far as you can before class.
- Submit what you have so far Commit and Push to Git.
- Take the daily check in quiz on Canvas.
- Come to class with lots of questions!
Reading in Your Own Data
A big part of data science is finding your own data and then analyzing it. To be able to use Python to analyze data you first have to learn how to load data into a JupyterLab Notebook. There are lots of different ways, but in this class we will used Pandas (pd) to read data.
- pd.read_csv This command lets you read files that end in .csv. You can create a file like this using a spreadsheet program and saving the file as a .csv. Often you can download data in .csv format. You will need to make sure that your data is in the same folder as the code you are writing or that you can tell your code EXACTLY where the file is.
- pd.read_excell This command lets you read files created by excell. Typicall with .xlxs file types. You will need to make sure that your data is in the same folder as the code you are writing or that you can tell your code EXACTLY where the file is.
- pd.read_html This command will try to automatically download table type data from a given website. You will need an internet connection to use this.
- pd.read_json This command lets you read .json files. JSON stand for JavaScript Object Notation. It is a very popular way to save data. You will need to make sure that your data is in the same folder as the code you are writing or that you can tell your code EXACTLY where the file is.
- pd.read_table If other methods fail, sometimes you can force Pandas to read a file by using the table method. You will need to make sure that your data is in the same folder as the code you are writing or that you can tell your code EXACTLY where the file is. You will have to tell what your delimiter is - what character separates the columns (sometimes this is a comma like in .csv, but sometimes it is something else like a colon or space)
Try reading in some data - csv
Go to the Cal Fire Website and scroll to the bottom to see the Incident Data. We will download the file named ALL DATA AS CSV this should put the data file into your Downloads folder.
Next you need to move the file mapdataall.csv from your Downloads folder into your Day8 folder where you are doing your homework. You can open your Downloads folder and drag the file into JupyterLab side bar. Then I can run the command
DF_raw = pd.read_csv('mapdataall.csv')
to load the data and look at the data frame.
= pd.read_csv('mapdataall.csv') DF
show(DF)
incident_name | incident_is_final | incident_date_last_update | incident_date_created | incident_administrative_unit | incident_administrative_unit_url | incident_county | incident_location | incident_acres_burned | incident_containment | incident_control | incident_cooperating_agencies | incident_longitude | incident_latitude | incident_type | incident_id | incident_url | incident_date_extinguished | incident_dateonly_extinguished | incident_dateonly_created | is_active | calfire_incident | notification_desired |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.1.4 from the internet...
(need help?) |
Answer some questions about this data
Q How many variables and observations?
Q How many different incident types are there?
Q Make a histogram of the acres burned and color the bars by whether or not the incident was a calfire incident. You will probably need to make a mask to remove very small and very large fires. How many fires burned more than 100,000 acres? What is the largest fire in the data?
Q EXTRA - CHALLENGE - See if you can create a graph that answers the question: Are fires getting bigger or more frequent over time? You get complete creative control on how to answer this question!
Try reading some data - html
Now we will define a variable:
my_website = 'https://www.fire.ca.gov/incidents'
and see what happens when we used the pd.read_html() command.
First we need to install the lxml module… you can always install packages by running: conda install -y package-name
!conda install -y lxml
Retrieving notices: ...working... done
Channels:
- conda-forge
- defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /home/bellajagu/anaconda3/envs/DATA101
added / updated specs:
- lxml
The following packages will be downloaded:
package | build
---------------------------|-----------------
libxslt-1.1.39 | h76b75d6_0 248 KB conda-forge
lxml-5.3.0 | py312he28fd5a_1 1.3 MB conda-forge
------------------------------------------------------------
Total: 1.6 MB
The following NEW packages will be INSTALLED:
libxslt conda-forge/linux-64::libxslt-1.1.39-h76b75d6_0
lxml conda-forge/linux-64::lxml-5.3.0-py312he28fd5a_1
The following packages will be UPDATED:
openssl 3.3.1-hb9d3cd8_3 --> 3.3.2-hb9d3cd8_0
Downloading and Extracting Packages:
lxml-5.3.0 | 1.3 MB | | 0%
libxslt-1.1.39 | 248 KB | | 0%
lxml-5.3.0 | 1.3 MB | 4 | 1%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
= 'https://www.fire.ca.gov/incidents'
my_website = pd.read_html(my_website) DF
DF
[ Incident Counties Started Acres Containment
0 Soda Fire San Luis Obispo 9/30/2024 80 0%
1 Ridge Fire Santa Barbara 9/29/2024 17 30%
2 Airport Fire Orange, Riverside 9/09/2024 23526 95%
3 Bridge Fire Los Angeles, San Bernardino 9/08/2024 54878 98%
4 Line Fire San Bernardino 9/05/2024 43251 80%
5 Coffee Pot Fire Tulare 8/03/2024 14104 93%]
Notice that this returns a LIST of DataFrames. I know this because it is not printed like a DataFrame and I can see the square brackets. To get the data frame I need to tell Python which item in the list I want to see. Since there is only one item on the list I can use
DF[0]
to get the DataFrame.
0] DF[
Incident | Counties | Started | Acres | Containment | |
---|---|---|---|---|---|
0 | Soda Fire | San Luis Obispo | 9/30/2024 | 80 | 0% |
1 | Ridge Fire | Santa Barbara | 9/29/2024 | 17 | 30% |
2 | Airport Fire | Orange, Riverside | 9/09/2024 | 23526 | 95% |
3 | Bridge Fire | Los Angeles, San Bernardino | 9/08/2024 | 54878 | 98% |
4 | Line Fire | San Bernardino | 9/05/2024 | 43251 | 80% |
5 | Coffee Pot Fire | Tulare | 8/03/2024 | 14104 | 93% |
read_html() opens the website and then tries to read in any table it finds there. You will see that we get less data than if we can find a .csv to download.
Try reading in some data from Wikipedia
Here we will explore academy award winning films. Go to the Wiki for the List of Academy Award Winning Films. Look at what type of data is there. How many tables? Any weird looking data?
Now read the html data into Python
= "https://en.wikipedia.org/wiki/List_of_Academy_Award%E2%80%93winning_films"
my_website = pd.read_html(my_website) DF
More than one table
You will notice that this results in a list of two different data frames! DF[0] contains all the movies from the main table on the website and DF[1] contains information about other aspects of the awards.
Let’s try a little analysis of the data in DF[0]
= DF[0]
DF_raw show(DF_raw)
Film | Year | Awards | Nominations |
---|---|---|---|
Loading ITables v2.1.4 from the internet...
(need help?) |
DF_raw.describe()
Film | Year | Awards | Nominations | |
---|---|---|---|---|
count | 1373 | 1373 | 1373 | 1373 |
unique | 1359 | 96 | 22 | 24 |
top | Little Women | 1945 | 1 | 1 |
freq | 3 | 21 | 979 | 550 |
Notice that we are getting very limited statistical information!! What is going wrong here?
Something is wrong with our data!
What should the data types be:
- Film - should be categorical (a word or string)
- Year - should be discrete numerical (a number or integer)
- Awards - should be discrete numerical (a number or integer)
- Nominations - should be discrete numerical (a number of integer)
So we should be able to get good statistics on ‘Year’, ‘Awards’, and ‘Nominations’. Let’s sub select just these columns.
= ['Year', 'Awards', 'Nominations']
my_columns DF_raw[my_columns].describe()
Year | Awards | Nominations | |
---|---|---|---|
count | 1373 | 1373 | 1373 |
unique | 96 | 22 | 24 |
top | 1945 | 1 | 1 |
freq | 21 | 979 | 550 |
This still is not making sense. What if I wanted to know the average number of awards?
Something is really wrong with our data - I get an error!
'Awards'].mean() DF_raw[
TypeError: Could not convert string '7421111111111742111111111136211111111111132212221121111142111111223111113433111111111114322221111111110 (1)[1]36221111111111126311111111111114431111111111111373221111111343222111111111552111111111111443222111111163222111111111832211111111143221111111111111432221111111111133331111111111114522111111111111122111111111632211111111111442211111111111154411111111115422111111111175311111111111222111111192111111111111152222111111111110 (1)6222111111111111733211111111433211111111111542221111111173211111111110 (1)432221111111111433 (1)11111111111119311111111110 (1)43321111111111117221111111111118322111111110 (1)544211111110 (1)842111111111144 (1)3321111114322211 (1)1111154221111111110 (1)532111111111110 (1)46 (1)31 (1)111111113442111111110 (1)0 (1)5431111111110 (2)632111 (1)11111117221111111111381 (1)1111111115322211111111721111111111111343211111111115 (1)3211111111111110 (1)532221111111116532221111115532111111110 (1)321311111162131161112145115115211211111117 (1)10 (1)1120 (1)1112111111111121211711111111113113121111130 (1)111411111211111111111111112111113111111211111131111110 (2)211211111110 (1)0 (1)1180 (1)21 (1)13119111171118 (2)2111112121113211411114110 (1)2114111110 (1)1531311211111120 (1)4111510 (1)22 (1)111122111512111110 (1)1111117111231110 (2)11111224311111114511211113210 (1)11111211161111811221111111111111211181211111112221112610 (1)1111111111112110 (1)1121221323120 (1)141111 (1)1112121111110 (1)1111110 (1)1111241 (1)111140 (1)111121111 (1)13114111331110 (1)1110 (1)1111131111112111111111111134111111131111211111111111110 (1)1111101111111111521111211111111312123' to numeric
Help my data is broken!
What you see happening here is that the data is not in the format that you expected. Now as a data scientist you have to decide what to do!
Data Types
In python there are four data types that we will work with most:
- String - this is a word, sentence, letter… surrounded by quotes.
- Integer - this is a number with no decimal points - fractions not allowed (rounded)
- Float - this is a number with decimal places - fractions allowed
- Boolean - this is True or False
Let’s decipher the error message:
“Could not convert string — to numeric.”
This is telling us that when Python tried to take that average it ran into strings (words). It then tried to convert those words into numbers, but ran into problems. So it cannot take the average here.
Data Type Exploration
Below are some variables that contain different data - we will use the type() command to see what data type they are.
= 1
num1 = 1.0
num2 = '1'
num3 = 1.6
num4 = '1.2' num5
print(type(num1))
print(type(num2))
print(type(num3))
print(type(num4))
print(type(num5))
<class 'int'>
<class 'float'>
<class 'str'>
<class 'float'>
<class 'str'>
Does this result match our definitions above?
Changing Data Types
Now what if we wanted to change these data types from on type to another? We can use the commands:
- str() - to change to a string
- int() - to change to an integer
- float() - to change to a float
Intentional Converting Between Types:
= str(num1)
num1_str print(type(num1_str))
num1_str
<class 'str'>
'1'
= int(num2)
num2_int print(type(num2_int))
num2_int
<class 'int'>
1
= int(num3)
num3_int print(type(num3_int))
num3_int
<class 'int'>
1
# int() will always round down!
= int(num4)
num4_int print(type(num4_int))
num4_int
<class 'int'>
1
= float(num5)
num5_float print(type(num5_float))
num5_float
<class 'float'>
1.2
Unintentional Converting Between Types
Sometimes Python functions will automatically convert between types or make type decisions for you!
Here is an example of adding an integer and a float. Python returns the result as a float.
= 1
num1 = 1.0
num2
type(num1+num2)
+num2 num1
2.0
Here we see that even though we send in integers, when we take the average it returns a float.
= [2,2,2,2]
my_numbers np.mean(my_numbers)
2.0
Fun with Strings!
Many times your data issues will arise when someone enters data in a wrong format. This often means that you have to deal with strings that need to be fixed. Here are some useful commands
- .lstrip() - removes left white space
- .rstrip() - removes right white space
- .replace(string1, string2) - replaces string1 with string2 as long as string1 is part of the original string
- .split(character) - splits the string into chunks based on the character provided
There are lots more, but this is enough to learn for now!
= ' Year: 2024/25 '
my_string len(my_string)
19
= my_string.lstrip()
my_string len(my_string)
16
= my_string.rstrip()
my_string len(my_string)
13
= my_string.replace(':',',')
my_string my_string
'Year, 2024/25'
= my_string.split('/')
my_broken_string my_broken_string
['Year, 2024', '25']
# Now we can select just one part of the original string
1] my_broken_string[
'25'
So How Can I Fix my Data?
Remember that this entire problem started when we were exploring the movie data stored in DF_raw. Let’s learn a new command to help us see what type of data is in the DataFrame. To report the type of data in each variable use the command
DF_raw.dtypes
DF_raw.dtypes
Film object
Year object
Awards object
Nominations object
dtype: object
object means that the data is a string or that there is at least one string in the column. Does it make sense for year to be a string?
= DF_raw['Year'].value_counts()
year_data = year_data.reset_index().rename(columns={"index": "value", 0: "count"})
DF_year DF_year
Year | count | |
---|---|---|
0 | 1945 | 21 |
1 | 1949 | 20 |
2 | 1942 | 20 |
3 | 1950 | 19 |
4 | 1948 | 19 |
... | ... | ... |
91 | 1931/32 | 10 |
92 | 1932/33 | 9 |
93 | 1928/29 | 7 |
94 | 1929/30 | 6 |
95 | 1930/31 | 6 |
96 rows × 2 columns
There are some years that look like normal integer years and others that are formatted like 1931/32. How can we fix this so we can do our analysis?
This is more of an art than a science - as a data scientist you need to make these types of decisions and keep track of why and how you make the decision. For the data above, my plan is to remove the extra data /32 and just use the first part of the string for year.
.apply() and lambda
The .apply() command lets you apply a function to every row along your data frame.
lambda is a special function you first write the word lambda and choose a variable followed by a colon
lambda x:
then you give an operation to do to every x (in this case every entry in the “Year” column)
x.split('/')[0]
this will split up the string so that 1930/31 becomes [1930, 31] then the [0] just selects the first part of the list.
We can apply this operation and save the information over the “Year” column so the data will be in a better format!
'Year'] = DF_raw['Year'].apply(lambda x: x.split('/')[0]) DF_raw[
= DF_raw['Year'].value_counts()
year_data = year_data.reset_index().rename(columns={"index": "value", 0: "count"})
DF_year DF_year
Year | count | |
---|---|---|
0 | 1945 | 21 |
1 | 1949 | 20 |
2 | 1942 | 20 |
3 | 1950 | 19 |
4 | 1948 | 19 |
... | ... | ... |
91 | 1931 | 10 |
92 | 1932 | 9 |
93 | 1928 | 7 |
94 | 1929 | 6 |
95 | 1930 | 6 |
96 rows × 2 columns
Now we will turn all the data in the year column into integers. There are lots of ways to do this, but lets practice the .apply() with a lambda method.
'Year'] = DF_raw['Year'].apply(lambda x: int(x)) DF_raw[
DF_raw.dtypes
Film object
Year int64
Awards object
Nominations object
dtype: object
Notice that we fixed the year column! Lets start working on the other columns.
Do the same type of thing to the Awards column
Look at what kind of data is in the “Awards” column.
= DF_raw['Awards'].value_counts()
award_data = award_data.reset_index().rename(columns={"index": "value", 0: "count"})
DF_award DF_award
Awards | count | |
---|---|---|
0 | 1 | 979 |
1 | 2 | 156 |
2 | 3 | 77 |
3 | 4 | 47 |
4 | 0 (1) | 32 |
5 | 5 | 25 |
6 | 7 | 12 |
7 | 6 | 12 |
8 | 1 (1) | 8 |
9 | 8 | 7 |
10 | 11 | 3 |
11 | 9 | 3 |
12 | 0 (2) | 3 |
13 | 0 (1)[1] | 1 |
14 | 3 (1) | 1 |
15 | 4 (1) | 1 |
16 | 6 (1) | 1 |
17 | 5 (1) | 1 |
18 | 7 (1) | 1 |
19 | 8 (2) | 1 |
20 | 2 (1) | 1 |
21 | 10 | 1 |
What the heck are these extra () and [] numbers?
To understand this better I had to go back to the website. It seems that things in parenthesis are special awards and things in square brackets are citations.
So I will chose to remove this information. THIS IS JUST ONE CHOICE - alternatively we could choose to add another column just for special awards if these were important to our analysis.
Looking at the data, I only want the numbers to the left of a ( so I will use this character to split my strings. I will also turn all the data into integer types.
'Awards'] = DF_raw['Awards'].apply(lambda x: int(x.split('(')[0])) DF_raw[
= DF_raw['Awards'].value_counts()
award_data = award_data.reset_index().rename(columns={"index": "value", 0: "count"})
DF_award DF_award
Awards | count | |
---|---|---|
0 | 1 | 987 |
1 | 2 | 157 |
2 | 3 | 78 |
3 | 4 | 48 |
4 | 0 | 36 |
5 | 5 | 26 |
6 | 7 | 13 |
7 | 6 | 13 |
8 | 8 | 8 |
9 | 11 | 3 |
10 | 9 | 3 |
11 | 10 | 1 |
DF_raw.dtypes
Film object
Year int64
Awards int64
Nominations object
dtype: object
You Try - Nominations column
Q Now it is your turn. Fix the Nominations column so that all the data is in integer format.
First look at the data and notice that the [ character seems to split the number of nominations from something that looks like a reference.
nom_data = DF_raw['Nominations'].value_counts()
DF_nom = nom_data.reset_index().rename(columns={"index": "value", 0: "count"})
show(DF_nom)
print('I can see that I want the data to the left of the [ character')
DF_raw['Nominations'] = DF_raw['Nominations'].apply(lambda x: int(x.split('[')[0]))
award_data = DF_raw['Awards'].value_counts()
DF_award = award_data.reset_index().rename(columns={"index": "value", 0: "count"})
show(DF_award)
DF_raw.dtypes
If everything went to plan you should see:
DF_raw.dtypes
Film object
Year int64
Awards int64
Nominations int64
dtype: object
Now we can do some statistics and continue our analysis!
DF_raw.describe()
Year | Awards | Nominations | |
---|---|---|---|
count | 1373.000000 | 1373.000000 | 1373.000000 |
mean | 1975.864530 | 1.573197 | 3.606701 |
std | 27.416755 | 1.392371 | 3.125371 |
min | 1927.000000 | 0.000000 | 0.000000 |
25% | 1952.000000 | 1.000000 | 1.000000 |
50% | 1975.000000 | 1.000000 | 2.000000 |
75% | 2000.000000 | 2.000000 | 5.000000 |
max | 2023.000000 | 11.000000 | 14.000000 |
Warning - Be Kind!
Fixing messy data can be a challenging and complicated process. Please be nice to yourself!
What is expected in this class:
- You can find some data online and load it into Python
- You can figure out what types of data are in each column
- You can say in words what format you would like the data to be in.
- You can patiently TRY to fix the data.
- You come get help from your Professor, TA, or friends before you get too frustrated!
Try to find some data of your own.
Your homework today will be to see if you can find some data of your own. This can be the first steps you take toward your final project.