# Some basic package imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
= 'colab' pio.renderers.defaule
Intermediate Data Science
Data Loading, Storage, and File Formats
Intermediate Data Science
Important Information
- Email: joanna_bieri@redlands.edu
- Office Hours take place in Duke 209 – Office Hours Schedule
- Class Website
- Syllabus
Data and File Formats
Reading in data and making it accessible to your data science techniques is the first step in data science success. This is called data loading or sometimes parsing. As a data scientists you may not have complete control over the format of your data when you first start a project. So it is part of your job to find a way to load it and parse it.
Here are some important data types:
- CSV (Comma-Separated Values)
CSV is a simple, text-based format that stores tabular data. Each line in the file represents a data record (a row), and each record consists of one or more fields (columns) separated by commas. The data does not have types and is read in as strings.
- Text - General
This is a generalization of CSV types where the data could be just written to lines, or stored as a single string. Once this is read in you would need to do a lot of work to parse it and turn it in to data. It is extremely inefficient, but sometimes this is how the data comes to you.
- JSON (JavaScript Object Notation)
JSON is a text-based format for representing semi-structured data using key-value pairs, similar to Python dictionaries. It’s great for data that has a nested or hierarchical structure. It is human readable but less efficient for storing large flat tables.
- Apache Parquet
Parquet is a binary, columnar storage format. Instead of storing data row-by-row like a CSV, it stores all the values for a single column together. This structure is highly optimized for analytical queries. This format can make it really fast to read in subsets of the data and allows for compression of the data. It is not human readable.
- Pickle
Pickle is a Python-specific binary format used for serializing and de-serializing Python objects. It can turn almost any Python object (like a list, a dictionary, or even a trained machine learning model) into a stream of bytes that can be saved to a file. It can handle very complex python objects and is extremely easy to use. Can have some compatibility and security issues.
- HDF5 (Hierarchical Data Format 5)
HDF5 is a high-performance binary format designed to store and organize massive amounts of data. It acts like a file system within a single file, allowing you to store multiple datasets (e.g., arrays, tables) in a hierarchical structure. It is great for large,complex scientific datasets, but has a steep learning curve.
- Geospatial Data
Geospatial data are typically categorized as either vector (representing features with points, lines, and polygons) or raster (representing features as a grid of pixels or cells). There are many formats for these files. One popular one is GeoJSON - which leverages the JSON file format but allows for vector based data. GeoTIFF is standard for raster data.
- Web APIs (Application Programming Interface) - not a data type but a data access type
Many websites have public APIs providing data feeds via one of the formats above. If a website has an API , you should use this interface to get data rather than trying to scrape the site.
- Databases - again not a data type
In many cases data can be stored in a database: SQL server, MySQL, NoSQL Mongo, or Graph Databases. In these cases you query the database to get access to subsets of the data. The choice of database is highly dependent on the project needs and scalability.
Pandas Data Loading Functions
Most of this class will focus on using Pandas for our data management. NOTE: if you are a geospatial student there is a sister package called GeoPandas that has very similar functionality to Pandas.
Function | Type | Description | Common File Types |
---|---|---|---|
pd.read_csv() |
Text | Reads comma-separated values into a DataFrame. | .csv |
pd.read_table() |
Text | Reads general delimited text files (default delimiter is tab). | .txt , .tsv |
pd.read_fwf() |
Text | Reads fixed-width formatted text files. | .txt |
pd.read_json() |
Text | Reads JSON (JavaScript Object Notation) data. | .json |
pd.read_html() |
Text | Parses HTML tables and returns them as a list of DataFrames. | .html |
pd.read_xml() |
Text | Reads XML data into a DataFrame. | .xml |
pd.read_sql() |
Text | Reads data from a SQL query or database connection. | (SQL database) |
pd.read_excel() |
Binary | Reads Excel spreadsheets (both .xls and .xlsx ). |
.xls , .xlsx |
pd.read_pickle() |
Binary | Reads a Python object saved with pickle . |
.pkl |
pd.read_parquet() |
Binary | Reads Apache Parquet columnar storage files. | .parquet |
pd.read_orc() |
Binary | Reads Apache ORC columnar storage files. | .orc |
pd.read_feather() |
Binary | Reads Feather binary columnar storage files. | .feather |
pd.read_sas() |
Binary | Reads SAS files (both XPORT and SAS7BDAT formats). | .xpt , .sas7bdat |
pd.read_spss() |
Binary | Reads SPSS system files. | .sav , .zsav |
pd.read_stata() |
Binary | Reads Stata dataset files. | .dta |
pd.read_hdf() |
Binary | Reads HDF5 (Hierarchical Data Format) files. | .h5 , .hdf5 |
Python Data Loading Functions
- Read Write
A standard build in way to read in files. It will read/write text line by line. Does not save variable types and everything is assumed to be a string.
# ---- WRITE ----
data = [
"Name,Score",
"Alice,85",
"Bob,92",
"Charlie,78"
]
with open("students.txt", "w") as f:
for line in data:
f.write(line + "\n")
# ---- READ ----
with open("students.txt", "r") as f:
contents = f.readlines()
print("Contents of students.txt:")
for line in contents:
print(line.strip())
- JSON
Great for reading in or saving dictionaries. It will preserve some data types: dict (keys must be strings), lists/tuple (tuples become lists), string, int, float, boolean, and None. Other data types will not be properly encoded without extra work.
import json
# Some Python object (dict with student scores)
students = {"Alice": 85, "Bob": 92, "Charlie": 78}
# ---- WRITE ----
with open("students.json", "w") as f:
json.dump(students, f, indent=4) # indent=4 makes it pretty
# ---- READ ----
with open("students.json", "r") as f:
loaded_students = json.load(f)
print("Data loaded from JSON:")
print(loaded_students)
- Pickle
Great in Python and preserves Python types. However, pickle is Python specific and it is not easy to load .pkl files into other languages.
import pickle
# Some Python object with mixed types
students = {
"Alice": (85, "A"), # tuple
"Bob": {"math": 92}, # dict
"Charlie": [78, 80] # list
}
# ---- WRITE ----
with open("students.pkl", "wb") as f: # 'wb' = write binary
pickle.dump(students, f)
# ---- READ ----
with open("students.pkl", "rb") as f: # 'rb' = read binary
loaded_students = pickle.load(f)
print("Original:", students)
print("Loaded:", loaded_students)
print("Types preserved:", type(loaded_students["Alice"]))
Code Examples
Here are some examples of reading in these data types:
Optional Arguments in Pandas Data Loading
All of the Pandas functions for Data Loading have optional arguments. These help refine how you load the data, how much data you load, data conversion, and even what to do with bad data. Here are the main categories:
- Indexing These arguments help you choose which columns or rows are returned in the DataFrame, and whether or not to get column or index names from the data set.
- Type inference and data conversion These arguments help you to converts data types or customize data as you read it in. This might include missing value markers.
- Date and time parsing These arguments help you combine date and time information spread over multiple columns into a single column in the result.
- Iterating Useful for very large files - you can load in chunks.
- Unclean data issues These arguments allow you to skip over header or footer rows, or tell Pandas what to do with comments or numbers that are split by commas.
It can be overwhelming when you see the full list of optional arguments:
# Run this cell to see the documentation for read_csv()
# pd.read_csv?
Reading in CSV files
= 'data/ex1.csv'
file_name = pd.read_csv(file_name) df
df
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
Notice that this data is read in with a nice header row telling Pandas how to label the columns. This is not always the case! Pandas assumes that the first row is a header and will lead to some confusing results if you just read in the file:
= 'data/ex2.csv'
file_name = pd.read_csv(file_name) df_nh
df_nh
1 | 2 | 3 | 4 | hello | |
---|---|---|---|---|---|
0 | 5 | 6 | 7 | 8 | world |
1 | 9 | 10 | 11 | 12 | foo |
Notice how the first row of the data was made into the column labels! This is not what we want… so we have to look into the optional arguments to correct the issue.
# With default numbered column names
= 'data/ex2.csv'
file_name = pd.read_csv(file_name, header=None)
df_nh df_nh
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
# With column names that you give
= 'data/ex2.csv'
file_name = pd.read_csv(file_name, header=None, names=["a", "b", "c", "d", "message"])
df_nh df_nh
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
# We can even move the "message" column to be the index (row labels)
= ["a", "b", "c", "d", "message"]
names "data/ex2.csv", names=names, index_col="message") pd.read_csv(
a | b | c | d | |
---|---|---|---|---|
message | ||||
hello | 1 | 2 | 3 | 4 |
world | 5 | 6 | 7 | 8 |
foo | 9 | 10 | 11 | 12 |
You Try
Here is a file that does not just read in nicely. See if you can use optional arguments to read it in.
Hint How many (and which) rows of this data are just junk?
Terminal Command Line:
The command
cat data/ex4.csv
if typed into a terminal prints out the contents of the file line by line. This lets us take a quick look at what is in the file. BEWARE - if you do this with a large file it will take a long time to print! Another great command is:
head data/ex4.csv
would just show the first 10 lines of the file!
# This code lets you look at the data
# the terminal command "cat" - prints the contents of a file
# when we do !cat filename we can look at the
= 'data/ex4.csv'
file_name !cat data/ex4.csv
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
# Your code here
Missing Data in CSV files
Here is another file that could give you issues. Notice that when we look at the file there are THREE different types of missing values:
- One marked NA
- One just missing ,,
- One marked None
all of these could mean different things in our data set, but look what happens when Pandas reads them in:
= 'data/ex4.csv'
file_name !cat data/ex5.csv
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,None,foo
= 'data/ex5.csv'
file_name = pd.read_csv(file_name)
df_nan df_nan
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4.0 | NaN |
1 | two | 5 | 6 | NaN | 8.0 | world |
2 | three | 9 | 10 | 11.0 | NaN | foo |
All three of these are treated as NaN! So we should be EXTRA careful when we see NaNs in our Pandas data! There are ways to specify to pandas how to handle special types of missing values
= pd.read_csv("data/ex5.csv", keep_default_na=False,
df_nan =["NA"])
na_values
df_nan
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3 | 4 | NaN |
1 | two | 5 | 6 | 8 | world | |
2 | three | 9 | 10 | 11 | None | foo |
df_nan.isna()
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | True |
1 | False | False | False | False | False | False |
2 | False | False | False | False | False | False |
Weird things do happen in CSV files!
Notice what happens when we try to read in a .txt file that is not comman separated:
!cat data/ex3.txt
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491
= pd.read_csv("data/ex3.txt")
df_oops df_oops
A B C | |
---|---|
0 | aaa -0.264438 -1.026059 -0.619500 |
1 | bbb 0.927272 0.302904 -0.032399 |
2 | ccc -0.264273 -0.386314 -0.217601 |
3 | ddd -0.871858 -0.348382 1.100491 |
What happened here? Well our file was not comma separated so pandas just did its best and separated on the line breaks. But not all is lost, we just need to tell Pandas that the data is separated by spaces!
= pd.read_csv("data/ex3.txt", sep="\\s+")
df_oops df_oops
A | B | C | |
---|---|---|---|
aaa | -0.264438 | -1.026059 | -0.619500 |
bbb | 0.927272 | 0.302904 | -0.032399 |
ccc | -0.264273 | -0.386314 | -0.217601 |
ddd | -0.871858 | -0.348382 | 1.100491 |
Reading in Pieces of CSV files
Sometimes your data is REALLY big and you just want to read in small pieces of it. This is especially useful when you are doing initial exploration of a HUGE data set. It allows you to look at a small bit of data quickly before diving into the full dataset.
= pd.read_csv("data/ex6.csv")
df df.shape
(10000, 5)
"data/ex6.csv", nrows=5) pd.read_csv(
one | two | three | four | key | |
---|---|---|---|---|---|
0 | 0.467976 | -0.038649 | -0.295344 | -1.824726 | L |
1 | -0.358893 | 1.404453 | 0.704965 | -0.200638 | B |
2 | -0.501840 | 0.659254 | -0.421691 | -0.057688 | G |
3 | 0.204886 | 1.074134 | 1.388361 | -0.982404 | R |
4 | 0.354628 | -0.133116 | 0.283763 | -0.837063 | Q |
= pd.read_csv("data/ex6.csv", chunksize=5)
chunker chunker
<pandas.io.parsers.readers.TextFileReader at 0x7c371345f9d0>
# Now lets say we just want to grab the keys and do a value_counts
# without holding the whole dataset in memory
# Create an empty series
= pd.Series([], dtype='int64')
tot # Loop over each piece in the chunker
for piece in chunker:
# Add the value counts for that chunk to the previous size.
# allowing for adding zero in the case that we don't observe a key in that chunk
= tot.add(piece["key"].value_counts(), fill_value=0) tot
# Sort the values decending
= tot.sort_values(ascending=False)
tot # Show the top 10 values in the output
10) tot.head(
key
E 368.0
X 364.0
L 346.0
O 343.0
Q 340.0
M 338.0
J 337.0
F 335.0
K 334.0
H 330.0
dtype: float64
# Now that chunker has been used it is empty
# This code will print nothing!
for piece in chunker:
print(piece)
This is just the beginning!
There is no way that we can go over every possible data catastrophe that can happen when reading in a .csv. But just know that there are lots of arguments that can help you to read in the data properly. It is always worth glancing at the data using cat or head in a terminal just to see what you should expect.
Writing data to a CSV file
Above we created a Pandas Series object that holds all the value counts for a large file. Now what if we want to save it so we don’t have to run that process again? We can use the .to_csv() command to save the data as a comma separated file. Again there are LOTS of optional arguments:
- sep=‘|’ – would change the separator from a comma to a |
- index=False – would not save the index as a separate column. If you look at the file generated by the code below, the index (row labels) are saved as a column in the final dataset.
- header=False – would tell pandas not to save the column names
- columns = [enter col names here] – would only save specific columns
# Running this command will create a new file in your working directory
'key_value_counts.csv') tot.to_csv(
Reading and Writing in JSON files
JSON files are stored in a format that mimics python dictionaries. Here you can see what is inside a .json file
!cat data/example.json
[{"a": 1, "b": 2, "c": 3},
{"a": 4, "b": 5, "c": 6},
{"a": 7, "b": 8, "c": 9}]
# Read it into pandas
= pd.read_json("data/example.json")
df df
a | b | c | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
# Make a small change
"sum"] = df.sum(axis=1)
df[ df
a | b | c | sum | |
---|---|---|---|---|
0 | 1 | 2 | 3 | 6 |
1 | 4 | 5 | 6 | 15 |
2 | 7 | 8 | 9 | 24 |
# Save the data
# This will save to your working directory
'example_with_sum.json') df.to_json(
Pandas and JSON files
- You can use either Pandas or import json to interact with json
- Pandas is very good at reading in a dictionary - BUT the data must have the right shapes - all values must have the same number of rows
Here is an example dictionary:
= {"name": "Wes",
my_dict "cities_lived": ["Akron", "Nashville", "New York", "San Francisco"],
"pet": None,
"siblings": [{"name": "Scott", "age": 34, "hobbies": ["guitars", "soccer"]},
"name": "Katie", "age": 42, "hobbies": ["diving", "art"]}]
{
} my_dict
{'name': 'Wes',
'cities_lived': ['Akron', 'Nashville', 'New York', 'San Francisco'],
'pet': None,
'siblings': [{'name': 'Scott', 'age': 34, 'hobbies': ['guitars', 'soccer']},
{'name': 'Katie', 'age': 42, 'hobbies': ['diving', 'art']}]}
Notice that in this data each key had a different length object stored in it. Also each of the objects is different!
This is 100% okay to do in a dictionary, but Pandas expects things to be cleaner. Let’s take a look at the parts of the dictionary
for key in my_dict.keys():
print(key)
print(my_dict[key])
print('----------')
name
Wes
----------
cities_lived
['Akron', 'Nashville', 'New York', 'San Francisco']
----------
pet
None
----------
siblings
[{'name': 'Scott', 'age': 34, 'hobbies': ['guitars', 'soccer']}, {'name': 'Katie', 'age': 42, 'hobbies': ['diving', 'art']}]
----------
# This code will give you an error
= pd.DataFrame(my_dict)
df df
ValueError: All arrays must be of the same length
# This code will work
= pd.DataFrame(my_dict['siblings'])
df df
name | age | hobbies | |
---|---|---|---|
0 | Scott | 34 | [guitars, soccer] |
1 | Katie | 42 | [diving, art] |
You Try
Can you explain what is going on in the examples above? Why does one give an error and the other works? What specifically is it about focusing in on the siblings data that allows pandas to read this?
Web Scraping
Python has many libraries for web scraping: lxml, Beautiful Soup, and html5lib. These packages are great for more advanced web scraping and dealing with malformed files. However, Pandas has a good build in function that will read html and parse tables as DataFrame objects.
First, lets make sure you have the web scraping packages installed
!conda install -y lxml beautifulsoup4 html5lib
Channels:
- conda-forge
- defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done
# All requested packages already installed.
We are just going to look at .read_html() for right now. The book has a nice discussion of how to parse .xml files if you are interested in getting deeper into web scraping. There are lots of other tutorials online and we will talk more about web scraping later in the semester.
# This file comes from the book github. If you want to see the full data look here:
# https://www.fdic.gov/bank-failures/failed-bank-list
= pd.read_html("data/fdic_failed_bank_list.html")
tables len(tables)
1
Pandas reads in the .html files and looks for tables. It puts the results in a list of data frames. In the example above Pandas found one table so to look at the DataFrame for this table we need to get it out of the list
= tables[0]
failures failures.head()
Bank Name | City | ST | CERT | Acquiring Institution | Closing Date | Updated Date | |
---|---|---|---|---|---|---|---|
0 | Allied Bank | Mulberry | AR | 91 | Today's Bank | September 23, 2016 | November 17, 2016 |
1 | The Woodbury Banking Company | Woodbury | GA | 11297 | United Bank | August 19, 2016 | November 17, 2016 |
2 | First CornerStone Bank | King of Prussia | PA | 35312 | First-Citizens Bank & Trust Company | May 6, 2016 | September 6, 2016 |
3 | Trust Company Bank | Memphis | TN | 9956 | The Bank of Fayette County | April 29, 2016 | September 6, 2016 |
4 | North Milwaukee State Bank | Milwaukee | WI | 20364 | First-Citizens Bank & Trust Company | March 11, 2016 | June 16, 2016 |
You Try
Here is an example website that contains a table:
https://www.scrapethissite.com/pages/forms/
- Open the website in your browser. Does the page that appears contain ALL the data about hockey teams?
- How does the web address change when you select the second page of the website.
- See if you can write code that will scrape all of the data. HINT: I would use a for loop that updates the web address and appends the new table to a list.
- Once you have the list of tables can you get them into a single data frame and save the data as a .csv?
# Here is how I could get one page
= 'https://www.scrapethissite.com/pages/forms/'
website = pd.read_html(website)
tables len(tables)
1
# Your code here
Pickle - binary data formats
DataFrame objects have a .to_pickle() method that is great for storing python data. Pickled files have the advantage of preserving python data types but the disadvantage that they are not easily readable by other programming languages. Let’s see how this works.
# Read in example 1 again
= pd.read_csv("data/ex1.csv")
df df
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
# Now lets write the file as a pickle
"frame_pickle.pkl") df.to_pickle(
# Lets look to see what is in the file
!cat frame_pickle.pkl
��f�pandas.core.frame�� DataFrame���)��}�(�_mgr��pandas.core.internals.managers��BlockManager����pandas._libs.internals��_unpickle_block����numpy._core.numeric��_frombuffer���(�`
��numpy��dtype����i8�����R�(K�<�NNNJ����J����Kt�bKK���C�t�R�builtins��slice���KKK��R�K��R�h�numpy._core.multiarray��_reconstruct���h�ndarray���K��Cb���R�(KKK��h�O8�����R�(K�|�NNNJ����J����K?t�b�]�(�hello��world��foo�et�bhKKK��R�K��R���]�(�pandas.core.indexes.base��
_new_Index���h=�Index���}�(�data�h%h'K��h)��R�(KK��h/�]�(�a��b��c��d��message�et�b�name�Nu��R�h?�pandas.core.indexes.range��
RangeIndex���}�(hON�start�K�stop�K�step�Ku��R�e��R��_typ�� dataframe�� _metadata�]��attrs�}��_flags�}��allows_duplicate_labels��sub.
Notice that the file is not human readable! This is a binary data format.
You should only use Pickle for short term data storage. Changes to python could mean that the data is not readable over the long term. Programmers try to preserve backward compatibility, but there is no absolute guarantee. Pickle is GREAT for when you need to save data in the short term to be available for Python codes.
Reading Excel Files
Pandas does have a package to read .xlsx files created by Microsoft Excel. To use this package we need to make sure a few modules are installed
!conda install -y openpyxl xlrd
Channels:
- conda-forge
- defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done
# All requested packages already installed.
One issue with .xslx files is that they can contain some pretty complex data arranged in sheets. If you just call .read_xlsx() pandas will try to just read one sheet and you could be missing a bunch of your data. Instead you can explore the file using .ExcelFile()
# Get the data into the xlsx parser
= pd.ExcelFile("data/ex1.xlsx") xlsx
# Notice this does not let you see the data
xlsx
<pandas.io.excel._base.ExcelFile at 0x70e023e13b10>
# Get the sheet names
xlsx.sheet_names
['Sheet1', 'Sheet2']
# Parse the sheet you want
# This is a data frame - you could totally just use xlsx.parse if you want
# df = xlsx.parse(sheet_name="Sheet1") would work
="Sheet1") xlsx.parse(sheet_name
Unnamed: 0 | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 | hello |
1 | 1 | 5 | 6 | 7 | 8 | world |
2 | 2 | 9 | 10 | 11 | 12 | foo |
# We could read the index for the first column
="Sheet1", index_col=0) xlsx.parse(sheet_name
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
Pandas has a built in read_excel() function. It is easy to remember! But beware of missing sheets!
# Notice that if we just call read_excel we only get the first sheet!
# There is no error to tell us we are missing a sheet.
= pd.read_excel("data/ex1.xlsx")
df df
Unnamed: 0 | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 | hello |
1 | 1 | 5 | 6 | 7 | 8 | world |
2 | 2 | 9 | 10 | 11 | 12 | foo |
# I might have to loop over the sheets to read them all in
= []
df_list for sn in xlsx.sheet_names:
= pd.read_excel("data/ex1.xlsx", sheet_name=sn,index_col=0)
df
df_list.append(df) df_list
[ a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo,
a b c d message
0 12 15 54 34 hello
1 23 5656 34 364 world
2 45 2 1 12 foo]
Interacting with Web APIs.
Anytime you are thinking about scraping data from a website you should first look to see if they have an API available for getting the data. Always download the data directly or use an API before choosing to scrape a website. Many websites are being written now to stop people from basic webs scraping because scraping can cause lots of problems. Websites don’t hate scraping, but they want to make sure access is controlled, fair, legal, and doesn’t break the site. That’s why many provide official APIs as a safer alternative.
First let’s install a package
!conda install -y requests
Channels:
- conda-forge
- defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done
# All requested packages already installed.
import requests
# Example: Free TheSportsDB API (soccer team lookup)
# Based on the information here: https://www.thesportsdb.com/documentation#search_v1
# We can use the URL
= "https://www.thesportsdb.com/api/v1/json/3/searchteams.php"
url # Then to find a specific team we add
# https://www.thesportsdb.com/api/v1/json/123/searchteams.php?t=Arsenal
# We can use this my sending a parameter into requests
= {"t": "San Diego Wave"} # Search for team name
params
# Get the data
= requests.get(url, params=params)
resp # Check that we got actual data
resp.raise_for_status() resp
<Response [200]>
# Look at the data
resp.json()
{'teams': [{'idTeam': '145094',
'idESPN': '0',
'idAPIfootball': '18451',
'intLoved': '1',
'strTeam': 'San Diego Wave',
'strTeamAlternate': '',
'strTeamShort': '',
'intFormedYear': '2021',
'strSport': 'Soccer',
'strLeague': 'American NWSL',
'idLeague': '4521',
'strLeague2': 'American NWSL Challenge Cup',
'idLeague2': '5178',
'strLeague3': 'CONCACAF W Champions Cup',
'idLeague3': '5640',
'strLeague4': '',
'idLeague4': None,
'strLeague5': '',
'idLeague5': None,
'strLeague6': '',
'idLeague6': None,
'strLeague7': '',
'idLeague7': None,
'strDivision': None,
'idVenue': '24552',
'strStadium': 'Snapdragon Stadium',
'strKeywords': '',
'strRSS': '',
'strLocation': 'San Diego, California',
'intStadiumCapacity': '35000',
'strWebsite': '',
'strFacebook': '',
'strTwitter': '',
'strInstagram': '',
'strDescriptionEN': "San Diego Wave FC is a National Women's Soccer League expansion team that is expected to begin play in 2022. The team will be based in San Diego, California. The team is owned by Ron Burkle.\r\n\r\nThe team will be the San Diego area's first women's professional soccer team since 2003, when the Women's United Soccer Association folded and forced the San Diego Spirit to disband.",
'strDescriptionDE': None,
'strDescriptionFR': None,
'strDescriptionCN': None,
'strDescriptionIT': None,
'strDescriptionJP': None,
'strDescriptionRU': None,
'strDescriptionES': None,
'strDescriptionPT': None,
'strDescriptionSE': None,
'strDescriptionNL': None,
'strDescriptionHU': None,
'strDescriptionNO': None,
'strDescriptionIL': None,
'strDescriptionPL': None,
'strColour1': '',
'strColour2': '',
'strColour3': '',
'strGender': 'Female',
'strCountry': 'United States',
'strBadge': 'https://r2.thesportsdb.com/images/media/team/badge/hb5yo61644517945.png',
'strLogo': 'https://r2.thesportsdb.com/images/media/team/logo/tpws861652790115.png',
'strFanart1': 'https://r2.thesportsdb.com/images/media/team/fanart/m145vl1652791877.jpg',
'strFanart2': 'https://r2.thesportsdb.com/images/media/team/fanart/lbt1x51652791632.jpg',
'strFanart3': 'https://r2.thesportsdb.com/images/media/team/fanart/tdl5qq1652791678.jpg',
'strFanart4': 'https://r2.thesportsdb.com/images/media/team/fanart/mo0pga1652791758.jpg',
'strBanner': 'https://r2.thesportsdb.com/images/media/team/banner/o5f9ca1652792180.jpg',
'strEquipment': 'https://r2.thesportsdb.com/images/media/team/equipment/z8gk2w1715200127.png',
'strYoutube': '',
'strLocked': 'unlocked'}]}
The data is in a dictionary format and to play around with it in Pandas you would need to do some work. You could imagine importing a few different teams each as a pandas series and then creating a data frame with all the information.
= "https://www.thesportsdb.com/api/v1/json/3/searchteams.php"
url
= ["San Diego Wave", "Portland Trail Blazers","Liverpool", "San Francisco Unicorns"]
my_teams = []
series_list for team in my_teams:
= {"t": team} # Search for team name
params = requests.get(url, params=params)
resp # Check that we got actual data
if resp.status_code == 200:
= resp.json()
data 'teams'][0]))
series_list.append(pd.Series(data[else:
print(f"Unable to find {team}")
pd.DataFrame(series_list)
idTeam | idESPN | idAPIfootball | intLoved | strTeam | strTeamAlternate | strTeamShort | intFormedYear | strSport | strLeague | ... | strBadge | strLogo | strFanart1 | strFanart2 | strFanart3 | strFanart4 | strBanner | strEquipment | strYoutube | strLocked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 145094 | 0 | 18451 | 1 | San Diego Wave | 2021 | Soccer | American NWSL | ... | https://r2.thesportsdb.com/images/media/team/b... | https://r2.thesportsdb.com/images/media/team/l... | https://r2.thesportsdb.com/images/media/team/f... | https://r2.thesportsdb.com/images/media/team/f... | https://r2.thesportsdb.com/images/media/team/f... | https://r2.thesportsdb.com/images/media/team/f... | https://r2.thesportsdb.com/images/media/team/b... | https://r2.thesportsdb.com/images/media/team/e... | unlocked | |||
1 | 134888 | None | 156 | None | Portland Trail Blazers | POR | 1970 | Basketball | NBA | ... | https://r2.thesportsdb.com/images/media/team/b... | https://r2.thesportsdb.com/images/media/team/l... | https://r2.thesportsdb.com/images/media/team/f... | https://r2.thesportsdb.com/images/media/team/f... | https://r2.thesportsdb.com/images/media/team/f... | https://r2.thesportsdb.com/images/media/team/f... | https://r2.thesportsdb.com/images/media/team/b... | https://r2.thesportsdb.com/images/media/team/e... | unlocked | ||
2 | 133602 | 364 | 40 | 11 | Liverpool | LFC, Liverpool FC | LIV | 1892 | Soccer | English Premier League | ... | https://r2.thesportsdb.com/images/media/team/b... | https://r2.thesportsdb.com/images/media/team/l... | https://r2.thesportsdb.com/images/media/team/f... | https://r2.thesportsdb.com/images/media/team/f... | https://r2.thesportsdb.com/images/media/team/f... | https://r2.thesportsdb.com/images/media/team/f... | https://r2.thesportsdb.com/images/media/team/b... | https://www.thesportsdb.com/images/media/team/... | youtube.com/user/LiverpoolFC | unlocked |
3 | 147493 | 0 | 0 | 1 | San Francisco Unicorns | SF Unicorns | SFU | 2023 | Cricket | Major League Cricket | ... | https://r2.thesportsdb.com/images/media/team/b... | https://r2.thesportsdb.com/images/media/team/l... | None | None | None | None | None | None | www.youtube.com/@SFOUnicorns | unlocked |
4 rows × 64 columns
Interacting with Databases
In many instances data is not stored as a .txt, .json, etc file. Instead it is stored in a database. There are many kinds of databases and the choice of a database is highly dependent on the type, size, scalability, and performance needed for the system.
Pandas does have the ability to interact with SQL-based relational databases. First lets create a database that we can play with.
import sqlite3
# We start by creating a table named "test" that tells SQLite what kind of data to expect
# a - text up to 20 characters
# b - text up to 20 characters
# c - a floating-point number (REAL)
# d - an integer
# Notice this is just a string that contains the information in a tuple like format.
= """
query CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);"""
# We open or create a SQLite database
# con is a connection object
= sqlite3.connect("mydata.sqlite")
con # We create the table using our text query above
con.execute(query)# We save the changes to the data base
con.commit()
Because con is a database connection object you can see what kind of commands you can run by typing
con.
and then pressing the TAB button. Remember you can always access the documentation with the ?
# Prepare some data for the database - a list of tuples that matches query above
= [("Atlanta", "Georgia", 1.25, 6),
data "Tallahassee", "Florida", 2.6, 3),
("Sacramento", "California", 1.7, 5)]
(# Define placeholders
= "INSERT INTO test VALUES(?, ?, ?, ?)"
stmt
# Insert multiple rows into the data base
con.executemany(stmt, data)# and save them
con.commit()
Now we are ready to interact with the database. We start by defining a cursor that will access the database. We need to use SQL statements (Structured Query Language), written as a Python string:
- SELECT - tells the database you want to retrieve data.
- \(*\) - means all columns. Instead of listing column names one by one (a, b, c, d), you grab everything.
- FROM test - says which table to select data from (here, the table is named test).
Here is a website were you can learn more: https://www.w3schools.com/sql/sql_intro.asp
= con.execute("SELECT * FROM test")
cursor cursor
<sqlite3.Cursor at 0x70e0b8894a40>
Now we have options:
- Use .fetchone() if you want to process rows one by one (e.g., in a loop).
- Use .fetchmany(n) if you want to grab results in chunks.
- Use .fetchall() if the dataset is small and you just want everything at once.
= cursor.fetchall()
rows rows
[('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]
= con.execute("SELECT * FROM test")
cursor = cursor.fetchone()
rows rows
('Atlanta', 'Georgia', 1.25, 6)
= con.execute("SELECT * FROM test")
cursor = cursor.fetchmany(2)
rows rows
[('Atlanta', 'Georgia', 1.25, 6), ('Tallahassee', 'Florida', 2.6, 3)]
SQL has options for more refined searches!
= con.execute("SELECT * FROM test WHERE b=='California'")
cursor = cursor.fetchall()
rows rows
[('Sacramento', 'California', 1.7, 5)]
= con.execute("SELECT * FROM test WHERE d>3")
cursor = cursor.fetchall()
rows rows
[('Atlanta', 'Georgia', 1.25, 6), ('Sacramento', 'California', 1.7, 5)]
= con.execute("SELECT * FROM test ORDER BY c DESC")
cursor = cursor.fetchall()
rows rows
[('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5),
('Atlanta', 'Georgia', 1.25, 6)]
Code provided by Websites
Some websites provide modules and Python code that help you read your data directly into Python. For example Kaggle provides information about how to download the data and get the path to the data directly in Python.
Here is an example:
import kagglehub
# Download latest version of the data
= kagglehub.dataset_download("mdsultanulislamovi/student-stress-monitoring-datasets")
path
print("Path to dataset files:", path)
Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.8), please consider upgrading to the latest version (0.3.13).
Path to dataset files: /home/bellajagu/.cache/kagglehub/datasets/mdsultanulislamovi/student-stress-monitoring-datasets/versions/1
# Copy the data directory path
= '/home/bellajagu/.cache/kagglehub/datasets/mdsultanulislamovi/student-stress-monitoring-datasets/versions/1/'
path # List all the files in the directory
os.listdir(path)
['Stress_Dataset.csv', 'StressLevelDataset.csv']
# Now load the file you want to look at into Pandas
file = path+'Stress_Dataset.csv'
= pd.read_csv(file)
df df.head()
Gender | Age | Have you recently experienced stress in your life? | Have you noticed a rapid heartbeat or palpitations? | Have you been dealing with anxiety or tension recently? | Do you face any sleep problems or difficulties falling asleep? | Have you been dealing with anxiety or tension recently?.1 | Have you been getting headaches more often than usual? | Do you get irritated easily? | Do you have trouble concentrating on your academic tasks? | ... | Are you facing any difficulties with your professors or instructors? | Is your working environment unpleasant or stressful? | Do you struggle to find time for relaxation and leisure activities? | Is your hostel or home environment causing you difficulties? | Do you lack confidence in your academic performance? | Do you lack confidence in your choice of academic subjects? | Academic and extracurricular activities conflicting for you? | Do you attend classes regularly? | Have you gained/lost weight? | Which type of stress do you primarily experience? | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 20 | 3 | 4 | 2 | 5 | 1 | 2 | 1 | 2 | ... | 3 | 1 | 4 | 1 | 2 | 1 | 3 | 1 | 2 | Eustress (Positive Stress) - Stress that motiv... |
1 | 0 | 20 | 2 | 3 | 2 | 1 | 1 | 1 | 1 | 4 | ... | 3 | 2 | 1 | 1 | 3 | 2 | 1 | 4 | 2 | Eustress (Positive Stress) - Stress that motiv... |
2 | 0 | 20 | 5 | 4 | 2 | 2 | 1 | 3 | 4 | 2 | ... | 2 | 2 | 2 | 1 | 4 | 1 | 1 | 2 | 1 | Eustress (Positive Stress) - Stress that motiv... |
3 | 1 | 20 | 3 | 4 | 3 | 2 | 2 | 3 | 4 | 3 | ... | 1 | 1 | 2 | 1 | 2 | 1 | 1 | 5 | 3 | Eustress (Positive Stress) - Stress that motiv... |
4 | 0 | 20 | 3 | 3 | 3 | 2 | 2 | 4 | 4 | 4 | ... | 2 | 3 | 1 | 2 | 2 | 4 | 2 | 2 | 2 | Eustress (Positive Stress) - Stress that motiv... |
5 rows × 26 columns
Summary
In Data Science you need to be willing and able to interact with LOTS of different data types, file types, and query types. For most of this class we will read the data in directly. DATA 211 Database Management will give you lots more tools for creating and interacting with SQL type databases.
Homework 3
Go to Kaggle Datasets: https://www.kaggle.com/datasets
Find a data set that you are interested in looking at. You are welcome to work together and choose a data set as a group! You should read in this data and do some basic statistics on the data set. Answer the following questions:
- How many variables and observations?
- What type of data is contained?
- Are there any NaNs or weird data types that you can see?
- Most Kaggle datasets contain some basic stats or visualizations. See if you can recreate some of the plots or data you see on the website.
- Come up with at least one question of your own that you can answer by analyzing the data.