Intermediate Data Science

Pandas - Advanced Topics

Author

Joanna Bieri
DATA201

Welcome to Intermediate Data Science!

Important Information

Email: joanna_bieri@redlands.edu
Office Hours take place in Duke 209 – Office Hours Schedule
Class Website
Syllabus

Review of DATA 101

In completing the review of DATA 101 you have already started working with Pandas! You actually know a lot about how Pandas works from our last class. The goal for today is to dig a bit deeper into the type of data that is stored in a Pandas DataFrame and to see some of the main capabilities of Pandas.

# Some basic package imports
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

Data Structures in Pandas

The fundamental data types in Pandas are:

Series: a one-dimensional object containing a sequence of values of the same type and an associated array of data labels.
DataFrame: a rectangular table of data that contains an ordered, named collection of columns each of which could have a different data type. It has both a row and column index and it can be though of as a dictionary of Series.

Series Basics

example_series = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
example_series

d    4
b    7
a   -5
c    3
dtype: int64

# You can get the labels"
example_series.index

Index(['d', 'b', 'a', 'c'], dtype='object')

# Getting values from the series using the labels
print(example_series['a'])

# Getting multiple values
print(example_series[['a','b']])

-5
a   -5
b    7
dtype: int64

# You can grab all the data
example_series.array

<NumpyExtensionArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

# You can send it to a dictionary
example_series.to_dict()

{'d': 4, 'b': 7, 'a': -5, 'c': 3}

# You can filter the data and use boolean tests

# Show only positive values
example_series[example_series>0]

d    4
b    7
c    3
dtype: int64

# Check index for membership
'b' in example_series

True

# If you have a dictionary you can send it into a series.
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
example_series = pd.Series(sdata)
example_series

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

You Try:

Using series methods and python code:

Get a list of the states that are the index.
Check if Idaho is in the index.
Get just the states with numbers less then 20,000.
Run the provided code and explain the results. Why do we end up with NaN and what does NaN mean?

# Your code here 1

# Your code here 2

# Your code here 3

# Explain the results 4
states = ["California", "Ohio", "Oregon", "Texas"]
new_series = pd.Series(sdata, index=states)
new_series

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Enter your words here:

Series NaNs

new_series.isna()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

new_series.notna()

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

# Drop the NA data one time:
new_series.dropna()

Ohio      35000.0
Oregon    16000.0
Texas     71000.0
dtype: float64

# Notice that the original data is still in memory
new_series

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

# Drop the NA data from memory
new_series.dropna(inplace=True)

# Now California is no longer there!
new_series

Ohio      35000.0
Oregon    16000.0
Texas     71000.0
dtype: float64

DataFrames - Basics

Most of the time we will be working with DataFrames! Dictionaries work really well to define new data frames. You can access information using the column labels/keys or the row indexes

data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
df = pd.DataFrame(data)
df

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

df.keys()

Index(['state', 'year', 'pop'], dtype='object')

# Notice that each column is a Series with the index being whatever the row labels are
print(type(df['year']))
df['year']

<class 'pandas.core.series.Series'>

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

# If your column labels are well behaved you can access the data as a dot attribute
df.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

# Getting multiple columns
columns = ['state','pop']
df[columns]

	state	pop
0	Ohio	1.5
1	Ohio	1.7
2	Ohio	3.6
3	Nevada	2.4
4	Nevada	2.9
5	Nevada	3.2

One of the first things you will want to do is see what your data looks like and get basic information about the data frame. This is especially true when you read in big data sets.

df.head() #first five rows

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9

df.tail() #last five rows

	state	year	pop
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

df.shape # get the number of (rows,columns)

(6, 3)

DataFrames - Data Manipulation

# We can add new columns! Just give the column a name and some data!
# Here I am using the numpy random sample command.

df['debt'] = np.random.random_sample(len(df))
df

	state	year	pop	debt
0	Ohio	2000	1.5	0.926616
1	Ohio	2001	1.7	0.247992
2	Ohio	2002	3.6	0.258766
3	Nevada	2001	2.4	0.916836
4	Nevada	2002	2.9	0.388147
5	Nevada	2003	3.2	0.302799

# We can delete a column
del df['debt']
df

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

You Try

How does the following command work?
See if you can add a column that check if the year is greater than 2001.

# How does this work 1
df['eastern'] = df['state'] == 'Ohio'
df

	state	year	pop	eastern
0	Ohio	2000	1.5	True
1	Ohio	2001	1.7	True
2	Ohio	2002	3.6	True
3	Nevada	2001	2.4	False
4	Nevada	2002	2.9	False
5	Nevada	2003	3.2	False

Enter your words here:

# Your code here 2

Data Frames - Functionality

There are lots of things you will want to be able to do to make your data cleaner and more readable. This is just a short list of some of the most popular commands and operations.

# Reindexing updates the index (row labels)
df = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"],
                     columns=["Ohio", "Texas", "California"])
df

	Ohio	Texas	California
a	0	1	2
c	3	4	5
d	6	7	8

# now notice that we are missing b!
df = df.reindex(index=["a", "b", "c", "d"])
df

	Ohio	Texas	California
a	0.0	1.0	2.0
b	NaN	NaN	NaN
c	3.0	4.0	5.0
d	6.0	7.0	8.0

# We can also transpose the data so the columns become the index
df = df.T
df

	a	b	c	d
Ohio	0.0	NaN	3.0	6.0
Texas	1.0	NaN	4.0	7.0
California	2.0	NaN	5.0	8.0

# Once you have your index set you can get the row information
# .loc locates the row by the index name
df.loc['Ohio']

a    0.0
b    NaN
c    3.0
d    6.0
Name: Ohio, dtype: float64

# .iloc uses an integer to find the location
df.iloc[0]

a    0.0
b    NaN
c    3.0
d    6.0
Name: Ohio, dtype: float64

# You can drop rows
df = df.drop('Ohio')
df

	a	b	c	d
Texas	1.0	NaN	4.0	7.0
California	2.0	NaN	5.0	8.0

# You can drop columns
df = df.drop(columns='d')
df

	a	b	c
Texas	1.0	NaN	4.0
California	2.0	NaN	5.0

# You can drop NaNs but be careful it drops every row that has a NaN
df = df.dropna()
df

	a	b	c

DataFrames - Filtering and Masking

df = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"],
                     columns=["Ohio", "Texas", "California"])
df

	Ohio	Texas	California
a	0	1	2
c	3	4	5
d	6	7	8

# Boolean comparison are the entries greater than three?
df>3

	Ohio	Texas	California
a	False	False	False
c	False	True	True
d	True	True	True

# Return data frame values
df[df>3]

	Ohio	Texas	California
a	NaN	NaN	NaN
c	NaN	4.0	5.0
d	6.0	7.0	8.0

# Set values based on comparison
df[df<3] = 0
df

	Ohio	Texas	California
a	0	0	0
c	3	4	5
d	6	7	8

# Mask return only rows that match values
mask = df['Texas']>3
df[mask]

	Ohio	Texas	California
c	3	4	5
d	6	7	8

# Get a specific row and column combination
df.loc['a',['Texas','California']]

Texas         0
California    0
Name: a, dtype: int64

You Try

Get rows a and d for just Ohio and Texas.

# Your code here

Data Frames - Arithmetic

df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [3, 4],"A": [5, 6]})

df1

	A
0	1
1	2

df2

	B	A
0	3	5
1	4	6

# Basic operations are component-wise
df1*2

	A
0	2
1	4

# Division works too
1/df1

	A
0	1.0
1	0.5

# Adding two data frames works, but only on columns that have the same label
df1+df2

	A	B
0	6	NaN
1	8	NaN

Data Frames - Functions

Often we want to apply a special function to parts of our data frame. We can do this using functions!

df = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])

df

	b	d	e
Utah	-2.199103	0.205690	-1.908214
Ohio	2.281752	-2.050707	0.006077
Texas	0.417250	-0.500769	-0.658412
Oregon	-0.213584	-0.595468	0.681829

# Many simple functions can be applied directly
np.abs(df)

	b	d	e
Utah	2.199103	0.205690	1.908214
Ohio	2.281752	2.050707	0.006077
Texas	0.417250	0.500769	0.658412
Oregon	0.213584	0.595468	0.681829

# We can define our own functions and apply them
def max_min(x):
    return x.max()-x.min()

# Now apply this to each of the columns
df.apply(max_min)

b    4.480856
d    2.256396
e    2.590043
dtype: float64

# Apply it to each of the rows
df.apply(max_min, axis='columns')

Utah      2.404793
Ohio      4.332459
Texas     1.075662
Oregon    1.277297
dtype: float64

# You can also use a lambda to do this - a quick way to define a one-time-use function.
df.apply(lambda x: x.max()-x.min())

b    4.480856
d    2.256396
e    2.590043
dtype: float64

Data Frames - Sorting and Grouping

df

	b	d	e
Utah	-2.199103	0.205690	-1.908214
Ohio	2.281752	-2.050707	0.006077
Texas	0.417250	-0.500769	-0.658412
Oregon	-0.213584	-0.595468	0.681829

df.sort_values(by='b', ascending=False)

	b	d	e
Ohio	2.281752	-2.050707	0.006077
Texas	0.417250	-0.500769	-0.658412
Oregon	-0.213584	-0.595468	0.681829
Utah	-2.199103	0.205690	-1.908214

# Multi Sort
df.sort_values(by=['b','e'], ascending=False)

	b	d	e
Ohio	2.281752	-2.050707	0.006077
Texas	0.417250	-0.500769	-0.658412
Oregon	-0.213584	-0.595468	0.681829
Utah	-2.199103	0.205690	-1.908214

# You can also sort by the index values
df.sort_index()

	b	d	e
Ohio	2.281752	-2.050707	0.006077
Oregon	-0.213584	-0.595468	0.681829
Texas	0.417250	-0.500769	-0.658412
Utah	-2.199103	0.205690	-1.908214

df.sort_index(ascending=False)

	b	d	e
Utah	-2.199103	0.205690	-1.908214
Texas	0.417250	-0.500769	-0.658412
Oregon	-0.213584	-0.595468	0.681829
Ohio	2.281752	-2.050707	0.006077

df = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])
df['location'] = ['West', 'Midwest', 'South', 'West']
df

	b	d	e	location
Utah	-0.284299	-2.377612	-0.466475	West
Ohio	-0.333410	0.946558	0.844442	Midwest
Texas	0.171615	-0.677120	-0.176359	South
Oregon	1.135900	-0.358185	1.003728	West

# You can use groupby for categorical data
groups = df.groupby('location')
for g in groups:
    print(g[0])
    display(g[1])

Midwest
South
West

	b	d	e	location
Ohio	-0.33341	0.946558	0.844442	Midwest

	b	d	e	location
Texas	0.171615	-0.67712	-0.176359	South

	b	d	e	location
Utah	-0.284299	-2.377612	-0.466475	West
Oregon	1.135900	-0.358185	1.003728	West

DataFrames - Quick Descriptive Statistics

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=["a", "b", "c", "d"],
                  columns=["one", "two"])
df

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

# Reduction methods: min, max, sum, mean - require at least one non NaN value
# It skips NaN values - and this can be misleading
df.sum(axis='index') # index is default

one    9.25
two   -5.80
dtype: float64

df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

df.sum(axis='columns',skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

df.describe()

	one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000

# Lets read in some data and look at some statistics
price = pd.read_pickle("data/yahoo_price.pkl")
volume = pd.read_pickle("data/yahoo_volume.pkl")

price.head()

	AAPL	GOOG	IBM	MSFT
Date
2010-01-04	27.990226	313.062468	113.304536	25.884104
2010-01-05	28.038618	311.683844	111.935822	25.892466
2010-01-06	27.592626	303.826685	111.208683	25.733566
2010-01-07	27.541619	296.753749	110.823732	25.465944
2010-01-08	27.724725	300.709808	111.935822	25.641571

volume.head()

	AAPL	GOOG	IBM	MSFT
Date
2010-01-04	123432400	3927000	6155300	38409100
2010-01-05	150476200	6031900	6841400	49749600
2010-01-06	138040000	7987100	5605300	58182400
2010-01-07	119282800	12876600	5840600	50559700
2010-01-08	111902700	9483900	4197200	51197400

# Lets apply a built in time series function to this data
# This gives us the percent change in the price from one date to the next
returns = price.pct_change()
returns.head()

	AAPL	GOOG	IBM	MSFT
Date
2010-01-04	NaN	NaN	NaN	NaN
2010-01-05	0.001729	-0.004404	-0.012080	0.000323
2010-01-06	-0.015906	-0.025209	-0.006496	-0.006137
2010-01-07	-0.001849	-0.023280	-0.003462	-0.010400
2010-01-08	0.006648	0.013331	0.010035	0.006897

returns.describe()

	AAPL	GOOG	IBM	MSFT
count	1713.000000	1713.000000	1713.000000	1713.000000
mean	0.000972	0.000671	0.000236	0.000595
std	0.016641	0.015830	0.012102	0.014667
min	-0.123558	-0.083775	-0.082790	-0.113995
25%	-0.007516	-0.006904	-0.006049	-0.007376
50%	0.000886	0.000270	0.000234	0.000312
75%	0.010422	0.008462	0.006806	0.008162
max	0.088741	0.160524	0.056652	0.104522

# Is the percent change in AAPL correlated with IBM
returns['AAPL'].corr(returns['IBM'])

np.float64(0.3868174361139099)

# We might ask if any of these returns are correlated or which are most highly correlated
returns.corr()

	AAPL	GOOG	IBM	MSFT
AAPL	1.000000	0.407919	0.386817	0.389695
GOOG	0.407919	1.000000	0.405099	0.465919
IBM	0.386817	0.405099	1.000000	0.499764
MSFT	0.389695	0.465919	0.499764	1.000000

# We might want to look at a covariance matrix
returns.cov()

	AAPL	GOOG	IBM	MSFT
AAPL	0.000277	0.000107	0.000078	0.000095
GOOG	0.000107	0.000251	0.000078	0.000108
IBM	0.000078	0.000078	0.000146	0.000089
MSFT	0.000095	0.000108	0.000089	0.000215

THERE ARE LOTS OF FUNTIONS PROVIDED! TRY . [tab]!

DataFrames - Value Counts and Membership

df = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4],
                     "Qu2": [2, 3, 1, 2, 3],
                     "Qu3": [1, 5, 2, 4, 4]})
df

	Qu1	Qu2	Qu3
0	1	2	1
1	3	3	5
2	4	1	2
3	3	2	4
4	4	3	4

# Quickly count observations
df['Qu1'].value_counts().sort_index()

Qu1
1    1
3    2
4    2
Name: count, dtype: int64

# Look for Unique values
df['Qu1'].unique()

array([1, 3, 4])

Day 2 - Homework - Quick analysis!

Your goal is to do a quick analysis of the Titanic data! You can answer any questions that you find interesting but here are some things to start with:

How many variables and observations? Which are Numerical/Categorical?
Do any of the columns have NaNs in them?
How many passengers survived?
Is survival correlated with Fare?
How many passengers were alone vs. traveling with family?
Were people traveling alone more or less likely to survive?

and so on… see if you can come up with some questions of your own!

# !conda install -y kagglehub

import kagglehub

# Download latest version
path = kagglehub.dataset_download("yasserh/titanic-dataset")

print("Path to dataset files:", path)

Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.8), please consider upgrading to the latest version (0.3.13).
Path to dataset files: /home/bellajagu/.cache/kagglehub/datasets/yasserh/titanic-dataset/versions/1

Variable Notes - PassengerId: Unique ID of the passenger - Survived: Survived (1) or died (0) - Pclass: Passenger’s class (1st, 2nd, or 3rd) - Name: Passenger’s name - Sex: Passenger’s sex - Age: Passenger’s age - SibSp: Number of siblings/spouses aboard the Titanic - Parch: Number of parents/children aboard the Titanic - Ticket: Ticket number - Fare: Fare paid for ticket - Cabin: Cabin number - Embarked: Where the passenger got on the ship (C — Cherbourg, S — Southampton, Q = Queenstown)

file = '/home/bellajagu/.cache/kagglehub/datasets/yasserh/titanic-dataset/versions/1/Titanic-Dataset.csv'
df = pd. read_csv(file)

df

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns