Intermediate Data Science

Pandas - Advanced Topics

Author

Joanna Bieri
DATA201

Welcome to Intermediate Data Science!

Important Information

Review of DATA 101

In completing the review of DATA 101 you have already started working with Pandas! You actually know a lot about how Pandas works from our last class. The goal for today is to dig a bit deeper into the type of data that is stored in a Pandas DataFrame and to see some of the main capabilities of Pandas.

# Some basic package imports
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

Data Structures in Pandas

The fundamental data types in Pandas are:

  • Series: a one-dimensional object containing a sequence of values of the same type and an associated array of data labels.
  • DataFrame: a rectangular table of data that contains an ordered, named collection of columns each of which could have a different data type. It has both a row and column index and it can be though of as a dictionary of Series.

Series Basics

example_series = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
example_series
d    4
b    7
a   -5
c    3
dtype: int64
# You can get the labels"
example_series.index
Index(['d', 'b', 'a', 'c'], dtype='object')
# Getting values from the series using the labels
print(example_series['a'])

# Getting multiple values
print(example_series[['a','b']])
-5
a   -5
b    7
dtype: int64
# You can grab all the data
example_series.array
<NumpyExtensionArray>
[4, 7, -5, 3]
Length: 4, dtype: int64
# You can send it to a dictionary
example_series.to_dict()
{'d': 4, 'b': 7, 'a': -5, 'c': 3}
# You can filter the data and use boolean tests

# Show only positive values
example_series[example_series>0]
d    4
b    7
c    3
dtype: int64
# Check index for membership
'b' in example_series
True
# If you have a dictionary you can send it into a series.
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
example_series = pd.Series(sdata)
example_series
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

You Try:

Using series methods and python code:

  1. Get a list of the states that are the index.
  2. Check if Idaho is in the index.
  3. Get just the states with numbers less then 20,000.
  4. Run the provided code and explain the results. Why do we end up with NaN and what does NaN mean?
# Your code here 1
# Your code here 2
# Your code here 3
# Explain the results 4
states = ["California", "Ohio", "Oregon", "Texas"]
new_series = pd.Series(sdata, index=states)
new_series
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Enter your words here:

Series NaNs

new_series.isna()
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
new_series.notna()
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
# Drop the NA data one time:
new_series.dropna()
Ohio      35000.0
Oregon    16000.0
Texas     71000.0
dtype: float64
# Notice that the original data is still in memory
new_series
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
# Drop the NA data from memory
new_series.dropna(inplace=True)
# Now California is no longer there!
new_series
Ohio      35000.0
Oregon    16000.0
Texas     71000.0
dtype: float64

DataFrames - Basics

Most of the time we will be working with DataFrames! Dictionaries work really well to define new data frames. You can access information using the column labels/keys or the row indexes

data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
df = pd.DataFrame(data)
df
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2
df.keys()
Index(['state', 'year', 'pop'], dtype='object')
# Notice that each column is a Series with the index being whatever the row labels are
print(type(df['year']))
df['year']
<class 'pandas.core.series.Series'>
0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64
# If your column labels are well behaved you can access the data as a dot attribute
df.year
0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64
# Getting multiple columns
columns = ['state','pop']
df[columns]
state pop
0 Ohio 1.5
1 Ohio 1.7
2 Ohio 3.6
3 Nevada 2.4
4 Nevada 2.9
5 Nevada 3.2

One of the first things you will want to do is see what your data looks like and get basic information about the data frame. This is especially true when you read in big data sets.

df.head() #first five rows
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
df.tail() #last five rows
state year pop
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2
df.shape # get the number of (rows,columns)
(6, 3)

DataFrames - Data Manipulation

# We can add new columns! Just give the column a name and some data!
# Here I am using the numpy random sample command.

df['debt'] = np.random.random_sample(len(df))
df
state year pop debt
0 Ohio 2000 1.5 0.926616
1 Ohio 2001 1.7 0.247992
2 Ohio 2002 3.6 0.258766
3 Nevada 2001 2.4 0.916836
4 Nevada 2002 2.9 0.388147
5 Nevada 2003 3.2 0.302799
# We can delete a column
del df['debt']
df
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2

You Try

  1. How does the following command work?
  2. See if you can add a column that check if the year is greater than 2001.
# How does this work 1
df['eastern'] = df['state'] == 'Ohio'
df
state year pop eastern
0 Ohio 2000 1.5 True
1 Ohio 2001 1.7 True
2 Ohio 2002 3.6 True
3 Nevada 2001 2.4 False
4 Nevada 2002 2.9 False
5 Nevada 2003 3.2 False

Enter your words here:

# Your code here 2

Data Frames - Functionality

There are lots of things you will want to be able to do to make your data cleaner and more readable. This is just a short list of some of the most popular commands and operations.

# Reindexing updates the index (row labels)
df = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"],
                     columns=["Ohio", "Texas", "California"])
df
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
# now notice that we are missing b!
df = df.reindex(index=["a", "b", "c", "d"])
df
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
# We can also transpose the data so the columns become the index
df = df.T
df
a b c d
Ohio 0.0 NaN 3.0 6.0
Texas 1.0 NaN 4.0 7.0
California 2.0 NaN 5.0 8.0
# Once you have your index set you can get the row information
# .loc locates the row by the index name
df.loc['Ohio']
a    0.0
b    NaN
c    3.0
d    6.0
Name: Ohio, dtype: float64
# .iloc uses an integer to find the location
df.iloc[0]
a    0.0
b    NaN
c    3.0
d    6.0
Name: Ohio, dtype: float64
# You can drop rows
df = df.drop('Ohio')
df
a b c d
Texas 1.0 NaN 4.0 7.0
California 2.0 NaN 5.0 8.0
# You can drop columns
df = df.drop(columns='d')
df
a b c
Texas 1.0 NaN 4.0
California 2.0 NaN 5.0
# You can drop NaNs but be careful it drops every row that has a NaN
df = df.dropna()
df
a b c

DataFrames - Filtering and Masking

df = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"],
                     columns=["Ohio", "Texas", "California"])
df
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
# Boolean comparison are the entries greater than three?
df>3
Ohio Texas California
a False False False
c False True True
d True True True
# Return data frame values
df[df>3]
Ohio Texas California
a NaN NaN NaN
c NaN 4.0 5.0
d 6.0 7.0 8.0
# Set values based on comparison
df[df<3] = 0
df
Ohio Texas California
a 0 0 0
c 3 4 5
d 6 7 8
# Mask return only rows that match values
mask = df['Texas']>3
df[mask]
Ohio Texas California
c 3 4 5
d 6 7 8
# Get a specific row and column combination
df.loc['a',['Texas','California']]
Texas         0
California    0
Name: a, dtype: int64

You Try

  1. Get rows a and d for just Ohio and Texas.
# Your code here

Data Frames - Arithmetic

df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [3, 4],"A": [5, 6]})
df1
A
0 1
1 2
df2
B A
0 3 5
1 4 6
# Basic operations are component-wise
df1*2
A
0 2
1 4
# Division works too
1/df1
A
0 1.0
1 0.5
# Adding two data frames works, but only on columns that have the same label
df1+df2
A B
0 6 NaN
1 8 NaN

Data Frames - Functions

Often we want to apply a special function to parts of our data frame. We can do this using functions!

df = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])
df
b d e
Utah -2.199103 0.205690 -1.908214
Ohio 2.281752 -2.050707 0.006077
Texas 0.417250 -0.500769 -0.658412
Oregon -0.213584 -0.595468 0.681829
# Many simple functions can be applied directly
np.abs(df)
b d e
Utah 2.199103 0.205690 1.908214
Ohio 2.281752 2.050707 0.006077
Texas 0.417250 0.500769 0.658412
Oregon 0.213584 0.595468 0.681829
# We can define our own functions and apply them
def max_min(x):
    return x.max()-x.min()

# Now apply this to each of the columns
df.apply(max_min)
b    4.480856
d    2.256396
e    2.590043
dtype: float64
# Apply it to each of the rows
df.apply(max_min, axis='columns')
Utah      2.404793
Ohio      4.332459
Texas     1.075662
Oregon    1.277297
dtype: float64
# You can also use a lambda to do this - a quick way to define a one-time-use function.
df.apply(lambda x: x.max()-x.min())
b    4.480856
d    2.256396
e    2.590043
dtype: float64

Data Frames - Sorting and Grouping

df
b d e
Utah -2.199103 0.205690 -1.908214
Ohio 2.281752 -2.050707 0.006077
Texas 0.417250 -0.500769 -0.658412
Oregon -0.213584 -0.595468 0.681829
df.sort_values(by='b', ascending=False)
b d e
Ohio 2.281752 -2.050707 0.006077
Texas 0.417250 -0.500769 -0.658412
Oregon -0.213584 -0.595468 0.681829
Utah -2.199103 0.205690 -1.908214
# Multi Sort
df.sort_values(by=['b','e'], ascending=False)
b d e
Ohio 2.281752 -2.050707 0.006077
Texas 0.417250 -0.500769 -0.658412
Oregon -0.213584 -0.595468 0.681829
Utah -2.199103 0.205690 -1.908214
# You can also sort by the index values
df.sort_index()
b d e
Ohio 2.281752 -2.050707 0.006077
Oregon -0.213584 -0.595468 0.681829
Texas 0.417250 -0.500769 -0.658412
Utah -2.199103 0.205690 -1.908214
df.sort_index(ascending=False)
b d e
Utah -2.199103 0.205690 -1.908214
Texas 0.417250 -0.500769 -0.658412
Oregon -0.213584 -0.595468 0.681829
Ohio 2.281752 -2.050707 0.006077
df = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])
df['location'] = ['West', 'Midwest', 'South', 'West']
df
b d e location
Utah -0.284299 -2.377612 -0.466475 West
Ohio -0.333410 0.946558 0.844442 Midwest
Texas 0.171615 -0.677120 -0.176359 South
Oregon 1.135900 -0.358185 1.003728 West
# You can use groupby for categorical data
groups = df.groupby('location')
for g in groups:
    print(g[0])
    display(g[1])
Midwest
South
West
b d e location
Ohio -0.33341 0.946558 0.844442 Midwest
b d e location
Texas 0.171615 -0.67712 -0.176359 South
b d e location
Utah -0.284299 -2.377612 -0.466475 West
Oregon 1.135900 -0.358185 1.003728 West

DataFrames - Quick Descriptive Statistics

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=["a", "b", "c", "d"],
                  columns=["one", "two"])
df
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
# Reduction methods: min, max, sum, mean - require at least one non NaN value
# It skips NaN values - and this can be misleading
df.sum(axis='index') # index is default
one    9.25
two   -5.80
dtype: float64
df.sum(axis='columns')
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64
df.sum(axis='columns',skipna=False)
a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64
df.describe()
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
# Lets read in some data and look at some statistics
price = pd.read_pickle("data/yahoo_price.pkl")
volume = pd.read_pickle("data/yahoo_volume.pkl")
price.head()
AAPL GOOG IBM MSFT
Date
2010-01-04 27.990226 313.062468 113.304536 25.884104
2010-01-05 28.038618 311.683844 111.935822 25.892466
2010-01-06 27.592626 303.826685 111.208683 25.733566
2010-01-07 27.541619 296.753749 110.823732 25.465944
2010-01-08 27.724725 300.709808 111.935822 25.641571
volume.head()
AAPL GOOG IBM MSFT
Date
2010-01-04 123432400 3927000 6155300 38409100
2010-01-05 150476200 6031900 6841400 49749600
2010-01-06 138040000 7987100 5605300 58182400
2010-01-07 119282800 12876600 5840600 50559700
2010-01-08 111902700 9483900 4197200 51197400
# Lets apply a built in time series function to this data
# This gives us the percent change in the price from one date to the next
returns = price.pct_change()
returns.head()
AAPL GOOG IBM MSFT
Date
2010-01-04 NaN NaN NaN NaN
2010-01-05 0.001729 -0.004404 -0.012080 0.000323
2010-01-06 -0.015906 -0.025209 -0.006496 -0.006137
2010-01-07 -0.001849 -0.023280 -0.003462 -0.010400
2010-01-08 0.006648 0.013331 0.010035 0.006897
returns.describe()
AAPL GOOG IBM MSFT
count 1713.000000 1713.000000 1713.000000 1713.000000
mean 0.000972 0.000671 0.000236 0.000595
std 0.016641 0.015830 0.012102 0.014667
min -0.123558 -0.083775 -0.082790 -0.113995
25% -0.007516 -0.006904 -0.006049 -0.007376
50% 0.000886 0.000270 0.000234 0.000312
75% 0.010422 0.008462 0.006806 0.008162
max 0.088741 0.160524 0.056652 0.104522
# Is the percent change in AAPL correlated with IBM
returns['AAPL'].corr(returns['IBM'])
np.float64(0.3868174361139099)
# We might ask if any of these returns are correlated or which are most highly correlated
returns.corr()
AAPL GOOG IBM MSFT
AAPL 1.000000 0.407919 0.386817 0.389695
GOOG 0.407919 1.000000 0.405099 0.465919
IBM 0.386817 0.405099 1.000000 0.499764
MSFT 0.389695 0.465919 0.499764 1.000000
# We might want to look at a covariance matrix
returns.cov()
AAPL GOOG IBM MSFT
AAPL 0.000277 0.000107 0.000078 0.000095
GOOG 0.000107 0.000251 0.000078 0.000108
IBM 0.000078 0.000078 0.000146 0.000089
MSFT 0.000095 0.000108 0.000089 0.000215

THERE ARE LOTS OF FUNTIONS PROVIDED! TRY . [tab]!

DataFrames - Value Counts and Membership

df = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4],
                     "Qu2": [2, 3, 1, 2, 3],
                     "Qu3": [1, 5, 2, 4, 4]})
df
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
# Quickly count observations
df['Qu1'].value_counts().sort_index()
Qu1
1    1
3    2
4    2
Name: count, dtype: int64
# Look for Unique values
df['Qu1'].unique()
array([1, 3, 4])

Day 2 - Homework - Quick analysis!

Your goal is to do a quick analysis of the Titanic data! You can answer any questions that you find interesting but here are some things to start with:

  1. How many variables and observations? Which are Numerical/Categorical?
  2. Do any of the columns have NaNs in them?
  3. How many passengers survived?
  4. Is survival correlated with Fare?
  5. How many passengers were alone vs. traveling with family?
  6. Were people traveling alone more or less likely to survive?

and so on… see if you can come up with some questions of your own!

# !conda install -y kagglehub
import kagglehub

# Download latest version
path = kagglehub.dataset_download("yasserh/titanic-dataset")

print("Path to dataset files:", path)
Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.8), please consider upgrading to the latest version (0.3.13).
Path to dataset files: /home/bellajagu/.cache/kagglehub/datasets/yasserh/titanic-dataset/versions/1

Variable Notes - PassengerId: Unique ID of the passenger - Survived: Survived (1) or died (0) - Pclass: Passenger’s class (1st, 2nd, or 3rd) - Name: Passenger’s name - Sex: Passenger’s sex - Age: Passenger’s age - SibSp: Number of siblings/spouses aboard the Titanic - Parch: Number of parents/children aboard the Titanic - Ticket: Ticket number - Fare: Fare paid for ticket - Cabin: Cabin number - Embarked: Where the passenger got on the ship (C — Cherbourg, S — Southampton, Q = Queenstown)

file = '/home/bellajagu/.cache/kagglehub/datasets/yasserh/titanic-dataset/versions/1/Titanic-Dataset.csv'
df = pd. read_csv(file)
df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns