# Some basic package imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
= 'colab' pio.renderers.defaule
Intermediate Data Science
Pandas - Advanced Topics
Welcome to Intermediate Data Science!
Important Information
- Email: joanna_bieri@redlands.edu
- Office Hours take place in Duke 209 – Office Hours Schedule
- Class Website
- Syllabus
Review of DATA 101
In completing the review of DATA 101 you have already started working with Pandas! You actually know a lot about how Pandas works from our last class. The goal for today is to dig a bit deeper into the type of data that is stored in a Pandas DataFrame and to see some of the main capabilities of Pandas.
Data Structures in Pandas
The fundamental data types in Pandas are:
- Series: a one-dimensional object containing a sequence of values of the same type and an associated array of data labels.
- DataFrame: a rectangular table of data that contains an ordered, named collection of columns each of which could have a different data type. It has both a row and column index and it can be though of as a dictionary of Series.
Series Basics
= pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
example_series example_series
d 4
b 7
a -5
c 3
dtype: int64
# You can get the labels"
example_series.index
Index(['d', 'b', 'a', 'c'], dtype='object')
# Getting values from the series using the labels
print(example_series['a'])
# Getting multiple values
print(example_series[['a','b']])
-5
a -5
b 7
dtype: int64
# You can grab all the data
example_series.array
<NumpyExtensionArray>
[4, 7, -5, 3]
Length: 4, dtype: int64
# You can send it to a dictionary
example_series.to_dict()
{'d': 4, 'b': 7, 'a': -5, 'c': 3}
# You can filter the data and use boolean tests
# Show only positive values
>0] example_series[example_series
d 4
b 7
c 3
dtype: int64
# Check index for membership
'b' in example_series
True
# If you have a dictionary you can send it into a series.
= {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
sdata = pd.Series(sdata)
example_series example_series
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
You Try:
Using series methods and python code:
- Get a list of the states that are the index.
- Check if Idaho is in the index.
- Get just the states with numbers less then 20,000.
- Run the provided code and explain the results. Why do we end up with NaN and what does NaN mean?
# Your code here 1
# Your code here 2
# Your code here 3
# Explain the results 4
= ["California", "Ohio", "Oregon", "Texas"]
states = pd.Series(sdata, index=states)
new_series new_series
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
Enter your words here:
Series NaNs
new_series.isna()
California True
Ohio False
Oregon False
Texas False
dtype: bool
new_series.notna()
California False
Ohio True
Oregon True
Texas True
dtype: bool
# Drop the NA data one time:
new_series.dropna()
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
# Notice that the original data is still in memory
new_series
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
# Drop the NA data from memory
=True) new_series.dropna(inplace
# Now California is no longer there!
new_series
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
DataFrames - Basics
Most of the time we will be working with DataFrames! Dictionaries work really well to define new data frames. You can access information using the column labels/keys or the row indexes
= {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
data "year": [2000, 2001, 2002, 2001, 2002, 2003],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
= pd.DataFrame(data)
df df
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
5 | Nevada | 2003 | 3.2 |
df.keys()
Index(['state', 'year', 'pop'], dtype='object')
# Notice that each column is a Series with the index being whatever the row labels are
print(type(df['year']))
'year'] df[
<class 'pandas.core.series.Series'>
0 2000
1 2001
2 2002
3 2001
4 2002
5 2003
Name: year, dtype: int64
# If your column labels are well behaved you can access the data as a dot attribute
df.year
0 2000
1 2001
2 2002
3 2001
4 2002
5 2003
Name: year, dtype: int64
# Getting multiple columns
= ['state','pop']
columns df[columns]
state | pop | |
---|---|---|
0 | Ohio | 1.5 |
1 | Ohio | 1.7 |
2 | Ohio | 3.6 |
3 | Nevada | 2.4 |
4 | Nevada | 2.9 |
5 | Nevada | 3.2 |
One of the first things you will want to do is see what your data looks like and get basic information about the data frame. This is especially true when you read in big data sets.
#first five rows df.head()
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
#last five rows df.tail()
state | year | pop | |
---|---|---|---|
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
5 | Nevada | 2003 | 3.2 |
# get the number of (rows,columns) df.shape
(6, 3)
DataFrames - Data Manipulation
# We can add new columns! Just give the column a name and some data!
# Here I am using the numpy random sample command.
'debt'] = np.random.random_sample(len(df))
df[ df
state | year | pop | debt | |
---|---|---|---|---|
0 | Ohio | 2000 | 1.5 | 0.926616 |
1 | Ohio | 2001 | 1.7 | 0.247992 |
2 | Ohio | 2002 | 3.6 | 0.258766 |
3 | Nevada | 2001 | 2.4 | 0.916836 |
4 | Nevada | 2002 | 2.9 | 0.388147 |
5 | Nevada | 2003 | 3.2 | 0.302799 |
# We can delete a column
del df['debt']
df
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
5 | Nevada | 2003 | 3.2 |
You Try
- How does the following command work?
- See if you can add a column that check if the year is greater than 2001.
# How does this work 1
'eastern'] = df['state'] == 'Ohio'
df[ df
state | year | pop | eastern | |
---|---|---|---|---|
0 | Ohio | 2000 | 1.5 | True |
1 | Ohio | 2001 | 1.7 | True |
2 | Ohio | 2002 | 3.6 | True |
3 | Nevada | 2001 | 2.4 | False |
4 | Nevada | 2002 | 2.9 | False |
5 | Nevada | 2003 | 3.2 | False |
Enter your words here:
# Your code here 2
Data Frames - Functionality
There are lots of things you will want to be able to do to make your data cleaner and more readable. This is just a short list of some of the most popular commands and operations.
# Reindexing updates the index (row labels)
= pd.DataFrame(np.arange(9).reshape((3, 3)),
df =["a", "c", "d"],
index=["Ohio", "Texas", "California"])
columns df
Ohio | Texas | California | |
---|---|---|---|
a | 0 | 1 | 2 |
c | 3 | 4 | 5 |
d | 6 | 7 | 8 |
# now notice that we are missing b!
= df.reindex(index=["a", "b", "c", "d"])
df df
Ohio | Texas | California | |
---|---|---|---|
a | 0.0 | 1.0 | 2.0 |
b | NaN | NaN | NaN |
c | 3.0 | 4.0 | 5.0 |
d | 6.0 | 7.0 | 8.0 |
# We can also transpose the data so the columns become the index
= df.T
df df
a | b | c | d | |
---|---|---|---|---|
Ohio | 0.0 | NaN | 3.0 | 6.0 |
Texas | 1.0 | NaN | 4.0 | 7.0 |
California | 2.0 | NaN | 5.0 | 8.0 |
# Once you have your index set you can get the row information
# .loc locates the row by the index name
'Ohio'] df.loc[
a 0.0
b NaN
c 3.0
d 6.0
Name: Ohio, dtype: float64
# .iloc uses an integer to find the location
0] df.iloc[
a 0.0
b NaN
c 3.0
d 6.0
Name: Ohio, dtype: float64
# You can drop rows
= df.drop('Ohio')
df df
a | b | c | d | |
---|---|---|---|---|
Texas | 1.0 | NaN | 4.0 | 7.0 |
California | 2.0 | NaN | 5.0 | 8.0 |
# You can drop columns
= df.drop(columns='d')
df df
a | b | c | |
---|---|---|---|
Texas | 1.0 | NaN | 4.0 |
California | 2.0 | NaN | 5.0 |
# You can drop NaNs but be careful it drops every row that has a NaN
= df.dropna()
df df
a | b | c |
---|
DataFrames - Filtering and Masking
= pd.DataFrame(np.arange(9).reshape((3, 3)),
df =["a", "c", "d"],
index=["Ohio", "Texas", "California"])
columns df
Ohio | Texas | California | |
---|---|---|---|
a | 0 | 1 | 2 |
c | 3 | 4 | 5 |
d | 6 | 7 | 8 |
# Boolean comparison are the entries greater than three?
>3 df
Ohio | Texas | California | |
---|---|---|---|
a | False | False | False |
c | False | True | True |
d | True | True | True |
# Return data frame values
>3] df[df
Ohio | Texas | California | |
---|---|---|---|
a | NaN | NaN | NaN |
c | NaN | 4.0 | 5.0 |
d | 6.0 | 7.0 | 8.0 |
# Set values based on comparison
<3] = 0
df[df df
Ohio | Texas | California | |
---|---|---|---|
a | 0 | 0 | 0 |
c | 3 | 4 | 5 |
d | 6 | 7 | 8 |
# Mask return only rows that match values
= df['Texas']>3
mask df[mask]
Ohio | Texas | California | |
---|---|---|---|
c | 3 | 4 | 5 |
d | 6 | 7 | 8 |
# Get a specific row and column combination
'a',['Texas','California']] df.loc[
Texas 0
California 0
Name: a, dtype: int64
You Try
- Get rows a and d for just Ohio and Texas.
# Your code here
Data Frames - Arithmetic
= pd.DataFrame({"A": [1, 2]})
df1 = pd.DataFrame({"B": [3, 4],"A": [5, 6]}) df2
df1
A | |
---|---|
0 | 1 |
1 | 2 |
df2
B | A | |
---|---|---|
0 | 3 | 5 |
1 | 4 | 6 |
# Basic operations are component-wise
*2 df1
A | |
---|---|
0 | 2 |
1 | 4 |
# Division works too
1/df1
A | |
---|---|
0 | 1.0 |
1 | 0.5 |
# Adding two data frames works, but only on columns that have the same label
+df2 df1
A | B | |
---|---|---|
0 | 6 | NaN |
1 | 8 | NaN |
Data Frames - Functions
Often we want to apply a special function to parts of our data frame. We can do this using functions!
= pd.DataFrame(np.random.standard_normal((4, 3)),
df =list("bde"),
columns=["Utah", "Ohio", "Texas", "Oregon"]) index
df
b | d | e | |
---|---|---|---|
Utah | -2.199103 | 0.205690 | -1.908214 |
Ohio | 2.281752 | -2.050707 | 0.006077 |
Texas | 0.417250 | -0.500769 | -0.658412 |
Oregon | -0.213584 | -0.595468 | 0.681829 |
# Many simple functions can be applied directly
abs(df) np.
b | d | e | |
---|---|---|---|
Utah | 2.199103 | 0.205690 | 1.908214 |
Ohio | 2.281752 | 2.050707 | 0.006077 |
Texas | 0.417250 | 0.500769 | 0.658412 |
Oregon | 0.213584 | 0.595468 | 0.681829 |
# We can define our own functions and apply them
def max_min(x):
return x.max()-x.min()
# Now apply this to each of the columns
apply(max_min) df.
b 4.480856
d 2.256396
e 2.590043
dtype: float64
# Apply it to each of the rows
apply(max_min, axis='columns') df.
Utah 2.404793
Ohio 4.332459
Texas 1.075662
Oregon 1.277297
dtype: float64
# You can also use a lambda to do this - a quick way to define a one-time-use function.
apply(lambda x: x.max()-x.min()) df.
b 4.480856
d 2.256396
e 2.590043
dtype: float64
Data Frames - Sorting and Grouping
df
b | d | e | |
---|---|---|---|
Utah | -2.199103 | 0.205690 | -1.908214 |
Ohio | 2.281752 | -2.050707 | 0.006077 |
Texas | 0.417250 | -0.500769 | -0.658412 |
Oregon | -0.213584 | -0.595468 | 0.681829 |
='b', ascending=False) df.sort_values(by
b | d | e | |
---|---|---|---|
Ohio | 2.281752 | -2.050707 | 0.006077 |
Texas | 0.417250 | -0.500769 | -0.658412 |
Oregon | -0.213584 | -0.595468 | 0.681829 |
Utah | -2.199103 | 0.205690 | -1.908214 |
# Multi Sort
=['b','e'], ascending=False) df.sort_values(by
b | d | e | |
---|---|---|---|
Ohio | 2.281752 | -2.050707 | 0.006077 |
Texas | 0.417250 | -0.500769 | -0.658412 |
Oregon | -0.213584 | -0.595468 | 0.681829 |
Utah | -2.199103 | 0.205690 | -1.908214 |
# You can also sort by the index values
df.sort_index()
b | d | e | |
---|---|---|---|
Ohio | 2.281752 | -2.050707 | 0.006077 |
Oregon | -0.213584 | -0.595468 | 0.681829 |
Texas | 0.417250 | -0.500769 | -0.658412 |
Utah | -2.199103 | 0.205690 | -1.908214 |
=False) df.sort_index(ascending
b | d | e | |
---|---|---|---|
Utah | -2.199103 | 0.205690 | -1.908214 |
Texas | 0.417250 | -0.500769 | -0.658412 |
Oregon | -0.213584 | -0.595468 | 0.681829 |
Ohio | 2.281752 | -2.050707 | 0.006077 |
= pd.DataFrame(np.random.standard_normal((4, 3)),
df =list("bde"),
columns=["Utah", "Ohio", "Texas", "Oregon"])
index'location'] = ['West', 'Midwest', 'South', 'West']
df[ df
b | d | e | location | |
---|---|---|---|---|
Utah | -0.284299 | -2.377612 | -0.466475 | West |
Ohio | -0.333410 | 0.946558 | 0.844442 | Midwest |
Texas | 0.171615 | -0.677120 | -0.176359 | South |
Oregon | 1.135900 | -0.358185 | 1.003728 | West |
# You can use groupby for categorical data
= df.groupby('location')
groups for g in groups:
print(g[0])
1]) display(g[
Midwest
South
West
b | d | e | location | |
---|---|---|---|---|
Ohio | -0.33341 | 0.946558 | 0.844442 | Midwest |
b | d | e | location | |
---|---|---|---|---|
Texas | 0.171615 | -0.67712 | -0.176359 | South |
b | d | e | location | |
---|---|---|---|---|
Utah | -0.284299 | -2.377612 | -0.466475 | West |
Oregon | 1.135900 | -0.358185 | 1.003728 | West |
DataFrames - Quick Descriptive Statistics
= pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
df 0.75, -1.3]],
[np.nan, np.nan], [=["a", "b", "c", "d"],
index=["one", "two"])
columns df
one | two | |
---|---|---|
a | 1.40 | NaN |
b | 7.10 | -4.5 |
c | NaN | NaN |
d | 0.75 | -1.3 |
# Reduction methods: min, max, sum, mean - require at least one non NaN value
# It skips NaN values - and this can be misleading
sum(axis='index') # index is default df.
one 9.25
two -5.80
dtype: float64
sum(axis='columns') df.
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
sum(axis='columns',skipna=False) df.
a NaN
b 2.60
c NaN
d -0.55
dtype: float64
df.describe()
one | two | |
---|---|---|
count | 3.000000 | 2.000000 |
mean | 3.083333 | -2.900000 |
std | 3.493685 | 2.262742 |
min | 0.750000 | -4.500000 |
25% | 1.075000 | -3.700000 |
50% | 1.400000 | -2.900000 |
75% | 4.250000 | -2.100000 |
max | 7.100000 | -1.300000 |
# Lets read in some data and look at some statistics
= pd.read_pickle("data/yahoo_price.pkl")
price = pd.read_pickle("data/yahoo_volume.pkl") volume
price.head()
AAPL | GOOG | IBM | MSFT | |
---|---|---|---|---|
Date | ||||
2010-01-04 | 27.990226 | 313.062468 | 113.304536 | 25.884104 |
2010-01-05 | 28.038618 | 311.683844 | 111.935822 | 25.892466 |
2010-01-06 | 27.592626 | 303.826685 | 111.208683 | 25.733566 |
2010-01-07 | 27.541619 | 296.753749 | 110.823732 | 25.465944 |
2010-01-08 | 27.724725 | 300.709808 | 111.935822 | 25.641571 |
volume.head()
AAPL | GOOG | IBM | MSFT | |
---|---|---|---|---|
Date | ||||
2010-01-04 | 123432400 | 3927000 | 6155300 | 38409100 |
2010-01-05 | 150476200 | 6031900 | 6841400 | 49749600 |
2010-01-06 | 138040000 | 7987100 | 5605300 | 58182400 |
2010-01-07 | 119282800 | 12876600 | 5840600 | 50559700 |
2010-01-08 | 111902700 | 9483900 | 4197200 | 51197400 |
# Lets apply a built in time series function to this data
# This gives us the percent change in the price from one date to the next
= price.pct_change()
returns returns.head()
AAPL | GOOG | IBM | MSFT | |
---|---|---|---|---|
Date | ||||
2010-01-04 | NaN | NaN | NaN | NaN |
2010-01-05 | 0.001729 | -0.004404 | -0.012080 | 0.000323 |
2010-01-06 | -0.015906 | -0.025209 | -0.006496 | -0.006137 |
2010-01-07 | -0.001849 | -0.023280 | -0.003462 | -0.010400 |
2010-01-08 | 0.006648 | 0.013331 | 0.010035 | 0.006897 |
returns.describe()
AAPL | GOOG | IBM | MSFT | |
---|---|---|---|---|
count | 1713.000000 | 1713.000000 | 1713.000000 | 1713.000000 |
mean | 0.000972 | 0.000671 | 0.000236 | 0.000595 |
std | 0.016641 | 0.015830 | 0.012102 | 0.014667 |
min | -0.123558 | -0.083775 | -0.082790 | -0.113995 |
25% | -0.007516 | -0.006904 | -0.006049 | -0.007376 |
50% | 0.000886 | 0.000270 | 0.000234 | 0.000312 |
75% | 0.010422 | 0.008462 | 0.006806 | 0.008162 |
max | 0.088741 | 0.160524 | 0.056652 | 0.104522 |
# Is the percent change in AAPL correlated with IBM
'AAPL'].corr(returns['IBM']) returns[
np.float64(0.3868174361139099)
# We might ask if any of these returns are correlated or which are most highly correlated
returns.corr()
AAPL | GOOG | IBM | MSFT | |
---|---|---|---|---|
AAPL | 1.000000 | 0.407919 | 0.386817 | 0.389695 |
GOOG | 0.407919 | 1.000000 | 0.405099 | 0.465919 |
IBM | 0.386817 | 0.405099 | 1.000000 | 0.499764 |
MSFT | 0.389695 | 0.465919 | 0.499764 | 1.000000 |
# We might want to look at a covariance matrix
returns.cov()
AAPL | GOOG | IBM | MSFT | |
---|---|---|---|---|
AAPL | 0.000277 | 0.000107 | 0.000078 | 0.000095 |
GOOG | 0.000107 | 0.000251 | 0.000078 | 0.000108 |
IBM | 0.000078 | 0.000078 | 0.000146 | 0.000089 |
MSFT | 0.000095 | 0.000108 | 0.000089 | 0.000215 |
THERE ARE LOTS OF FUNTIONS PROVIDED! TRY . [tab]!
DataFrames - Value Counts and Membership
= pd.DataFrame({"Qu1": [1, 3, 4, 3, 4],
df "Qu2": [2, 3, 1, 2, 3],
"Qu3": [1, 5, 2, 4, 4]})
df
Qu1 | Qu2 | Qu3 | |
---|---|---|---|
0 | 1 | 2 | 1 |
1 | 3 | 3 | 5 |
2 | 4 | 1 | 2 |
3 | 3 | 2 | 4 |
4 | 4 | 3 | 4 |
# Quickly count observations
'Qu1'].value_counts().sort_index() df[
Qu1
1 1
3 2
4 2
Name: count, dtype: int64
# Look for Unique values
'Qu1'].unique() df[
array([1, 3, 4])
Day 2 - Homework - Quick analysis!
Your goal is to do a quick analysis of the Titanic data! You can answer any questions that you find interesting but here are some things to start with:
- How many variables and observations? Which are Numerical/Categorical?
- Do any of the columns have NaNs in them?
- How many passengers survived?
- Is survival correlated with Fare?
- How many passengers were alone vs. traveling with family?
- Were people traveling alone more or less likely to survive?
and so on… see if you can come up with some questions of your own!
# !conda install -y kagglehub
import kagglehub
# Download latest version
= kagglehub.dataset_download("yasserh/titanic-dataset")
path
print("Path to dataset files:", path)
Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.8), please consider upgrading to the latest version (0.3.13).
Path to dataset files: /home/bellajagu/.cache/kagglehub/datasets/yasserh/titanic-dataset/versions/1
Variable Notes - PassengerId: Unique ID of the passenger - Survived: Survived (1) or died (0) - Pclass: Passenger’s class (1st, 2nd, or 3rd) - Name: Passenger’s name - Sex: Passenger’s sex - Age: Passenger’s age - SibSp: Number of siblings/spouses aboard the Titanic - Parch: Number of parents/children aboard the Titanic - Ticket: Ticket number - Fare: Fare paid for ticket - Cabin: Cabin number - Embarked: Where the passenger got on the ship (C — Cherbourg, S — Southampton, Q = Queenstown)
file = '/home/bellajagu/.cache/kagglehub/datasets/yasserh/titanic-dataset/versions/1/Titanic-Dataset.csv'
= pd. read_csv(file) df
df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns