# Some basic package imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
= 'colab' pio.renderers.defaule
Intermediate Data Science
Data Cleaning and Preparation
Intermediate Data Science
Important Information
- Email: joanna_bieri@redlands.edu
- Office Hours take place in Duke 209 – Office Hours Schedule
- Class Website
- Syllabus
Data Cleaning
Often in data science a huge portion of your time will be spent loading, cleaning, transforming, and rearranging data. Sometimes you will have data sets that contain many missing variables, others will have bad formatting such as numbers being read in as strings. Sometimes you will need to create dummy variables or introduce new variables into your data. Often you will need to take sub-samples of your data for training and testing a model or simply because the data is too big to read in all at one time. Fortunately, pandas and python have lots of tools to help you in this process. Here are the main topics we will cover:
- Handling Missing Data
- Transforming Data
- String Manipulation
- Categorical Data
Handling Missing Data
Missing data can happen for a wide variety of reasons:
- The data really does not exist for a certain observation: In population data babies would not have a date of marriage or a list of children.
- The data was improperly entered: Errors are easy to make.
- The data set was damaged: Reading or writing issues happen.
- The missing data (None, NaN, or NA) means something important: Maybe a student did not take a test and that is important in your analysis.
Pandas uses floating point NaN (Not a Number) to represent missing data.
np.nan
nan
type(np.nan)
float
It will interpret NA (Not Available) as a Python None. This will be considered NaN when doing analysis.
= pd.Series(["aardvark", np.nan, None, "avocado"])
string_data string_data
0 aardvark
1 NaN
2 None
3 avocado
dtype: object
Methods to handle missing data:
isna()
/isnull()
Description: Detect missing values.
df.isna()'column'].isnull() df[
notna()
/notnull()
Description: Detect non-missing values.
df.notna()'column'].notnull() df[
dropna()
Description: Remove missing values.
# Drop rows with any NA values df.dropna() =1) # Drop columns with any NA values df.dropna(axis
fillna()
Description: Fill NA values with a specified value or method.
0) # Replace NA with 0 df.fillna(
replace()
Description: Replace specified values including NA.
=np.nan, value=0) df.replace(to_replace
string_data.isna()
0 False
1 True
2 True
3 False
dtype: bool
string_data.notna()
0 True
1 False
2 False
3 True
dtype: bool
0) string_data.fillna(
0 aardvark
1 0
2 0
3 avocado
dtype: object
0) string_data.replace(np.nan,
0 aardvark
1 0
2 0
3 avocado
dtype: object
You will notice that these methods do not write over the data in memory. In other words, if we print with is in string_data we will see all the nans are still there:
string_data
0 aardvark
1 NaN
2 None
3 avocado
dtype: object
If you want to make the changes in memory you need to add the flag: in_place=True
, or reset the variable
string_data.fillna(0,inplace=True)
string_data = string_data.fillna(0)
0,inplace=True) string_data.fillna(
string_data
0 aardvark
1 0
2 0
3 avocado
dtype: object
BEWARE OF PYTHON mutable!
When you are saving data from one list to the next you should be very careful about how you do that! Python lists are mutable this means that when you set one list equal to another it does not make a new copy in memory, instead it copies a reference. Here is an example:
= [5,4,3,2,1]
list1 = list1
list2
print('Here is list2, it looks like a copy!')
print(list2)
print('Now we will change something in list2')
0] = 10
list2[print(list2)
print('Now look at list1')
print(list1)
print('BUT WE DIDN"T CHANGE LIST1 !!!!! WHY DID IT CHANGE????')
Here is list2, it looks like a copy!
[5, 4, 3, 2, 1]
Now we will change something in list2
[10, 4, 3, 2, 1]
Now look at list1
[10, 4, 3, 2, 1]
BUT WE DIDN"T CHANGE LIST1 !!!!! WHY DID IT CHANGE????
It changed because list1 and list2 point to the same object in memory. How do we stop this from happening? We need to use .copy()
when making a copy of a list to get new memory allocated. Lets do the same computation but this time use .copy()
= [5,4,3,2,1]
list1 = list1.copy() ## THIS IS OUR ONLY CHANGE
list2
print('Here is list2, it looks like a copy!')
print(list2)
print('Now we will change something in list2')
0] = 10
list2[print(list2)
print('Now look at list1')
print(list1)
print('List1 did not change')
Here is list2, it looks like a copy!
[5, 4, 3, 2, 1]
Now we will change something in list2
[10, 4, 3, 2, 1]
Now look at list1
[5, 4, 3, 2, 1]
List1 did not change
Moral of the mutable story
If you are creating a new variable by setting it equal to another list and you want to make changes to one without changing the other you should use .copy(). This is true of all mutable python types:
- list
- dict
- set
- pd.DataFrame
- pd.Series
- np.array
Filtering out Missing Data
You want to be careful and intentional when filtering out missing data. We will explore the process with a DataFrame that contains lots of missing data.
= pd.DataFrame([[1., 6.5, 3.], [ np.nan, np.nan, 1.],
data 6.5, 3.]])
[np.nan, np.nan, np.nan], [np.nan, data
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | NaN | NaN | 1.0 |
2 | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
# We can drop all NaNs
data.dropna()
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
# We can drop only rows that are completely NaNs
='all') data.dropna(how
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | NaN | NaN | 1.0 |
3 | NaN | 6.5 | 3.0 |
# We can drop only rows that contain more than two NaN's
=2) data.dropna(thresh
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
3 | NaN | 6.5 | 3.0 |
# We can drop only columns that contain more than two NaN's
=2,axis=1) data.dropna(thresh
1 | 2 | |
---|---|---|
0 | 6.5 | 3.0 |
1 | NaN | 1.0 |
2 | NaN | NaN |
3 | 6.5 | 3.0 |
Notice that each of these decisions creates a very different result! You should also notice that some optional commands are pretty common:
axis=0
do the calculation to the rows - usually defaultaxis=1
do the calculation to the columns
# data.dropna?
Filling in Missing Data
Sometimes you want to replace missing data in a DataFrame in a managed way so that it has predictable effects on the rest of your analysis. Most of the time you will use .fillna()
but there are some nice optional arguments that let you customize the command. Remember, to make changes in memory you need to add inplace=True
.
data
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | NaN | NaN | 1.0 |
2 | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
# Change NaN to zero
0) data.fillna(
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 0.0 | 0.0 | 1.0 |
2 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 6.5 | 3.0 |
# Change each column differently - use a dictionary!
# Keys are column names
# Values are the fill
0:np.nan,1:'Hello',2:0}) data.fillna({
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | NaN | Hello | 1.0 |
2 | NaN | Hello | 0.0 |
3 | NaN | 6.5 | 3.0 |
# Fill with a calculation - here will will fill with the column means
data.fillna(data.mean())
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.000000 |
1 | 1.0 | 6.5 | 1.000000 |
2 | 1.0 | 6.5 | 2.333333 |
3 | 1.0 | 6.5 | 3.000000 |
Data Transformation
Next we will talk a bit more about cleaning data:
- Removing Duplicate Data
- Replacing Data
- Renaming
- Discretizing and Binning
- Outliers
- Sampling
- Dummy Variables
Duplicate Data
Sometimes data sets will have duplicate variables, maybe they are not identical but the represent the same thing. Here column k1 has the words and k2 has the numerical values:
= pd.DataFrame({"k1": ["one", "two"] * 3 + ["two"],
data "k2": [1, 1, 2, 3, 3, 4, 4]})
data
k1 | k2 | |
---|---|---|
0 | one | 1 |
1 | two | 1 |
2 | one | 2 |
3 | two | 3 |
4 | one | 3 |
5 | two | 4 |
6 | two | 4 |
Notice that observations 5 and 6 are identical! We can check for this using the .duplicated()
and drop_duplicates()
command.
data.duplicated()
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
data.drop_duplicates()
k1 | k2 | |
---|---|---|
0 | one | 1 |
1 | two | 1 |
2 | one | 2 |
3 | two | 3 |
4 | one | 3 |
5 | two | 4 |
You can filter duplicates on a subset of the data if you specify the rows that you want to filter. Lets add a column to our data and then drop duplicates.
"v1"] = range(7)
data['v2'] = [1 for i in range(7)]
data[ data
k1 | k2 | v1 | v2 | |
---|---|---|---|---|
0 | one | 1 | 0 | 1 |
1 | two | 1 | 1 | 1 |
2 | one | 2 | 2 | 1 |
3 | two | 3 | 3 | 1 |
4 | one | 3 | 4 | 1 |
5 | two | 4 | 5 | 1 |
6 | two | 4 | 6 | 1 |
# We can drop any duplicates just in the k1 column - returns just the first two observations
# Notice we still get back all the other data in those two rows
=['k1']) data.drop_duplicates(subset
k1 | k2 | v1 | v2 | |
---|---|---|---|---|
0 | one | 1 | 0 | 1 |
1 | two | 1 | 1 | 1 |
# We can look for duplicates across two different columns
=['k2','v2']) data.drop_duplicates(subset
k1 | k2 | v1 | v2 | |
---|---|---|---|---|
0 | one | 1 | 0 | 1 |
2 | one | 2 | 2 | 1 |
3 | two | 3 | 3 | 1 |
5 | two | 4 | 5 | 1 |
Transforming Data and Mapping
We will imagine that you have some data that tells you about the amount of different types of fruit. Say you want to add a new column to this data that says what kind of fruit each one is: citrus, berry, tropical, or pome. You can use a dictionary and the .map()
function to add this information.
# Here is our made up data
= pd.DataFrame({
data "food": ["orange", "blueberry", "orange",
"banana", "strawberry", "orange",
"banana", "apple", "blackberry"],
"ounces": [4, -999, 12, 6, 7.5, 8, -999, 5, 6]
})
data
food | ounces | |
---|---|---|
0 | orange | 4.0 |
1 | blueberry | -999.0 |
2 | orange | 12.0 |
3 | banana | 6.0 |
4 | strawberry | 7.5 |
5 | orange | 8.0 |
6 | banana | -999.0 |
7 | apple | 5.0 |
8 | blackberry | 6.0 |
# Here is a dictionary mapping fruits to categories
= {
food_to_category "orange": "citrus",
"blueberry": "berry",
"strawberry": "berry",
"blackberry": "berry",
"banana": "tropical",
"apple": "pome"
}
# Now we will add the column
'category'] = data['food'].map(food_to_category)
data[ data
food | ounces | category | |
---|---|---|---|
0 | orange | 4.0 | citrus |
1 | blueberry | -999.0 | berry |
2 | orange | 12.0 | citrus |
3 | banana | 6.0 | tropical |
4 | strawberry | 7.5 | berry |
5 | orange | 8.0 | citrus |
6 | banana | -999.0 | tropical |
7 | apple | 5.0 | pome |
8 | blackberry | 6.0 | berry |
Replacing Values
We have seen above how to replace NaN values, but what if there were other types of things you wanted to replace in the dataset? The .replace()
function can replace any data you want with replacement data.
# Change the category given to banana
'tropical','berry',inplace=True)
data.replace( data
food | ounces | category | |
---|---|---|---|
0 | orange | 4.0 | citrus |
1 | blueberry | -999.0 | berry |
2 | orange | 12.0 | citrus |
3 | banana | 6.0 | berry |
4 | strawberry | 7.5 | berry |
5 | orange | 8.0 | citrus |
6 | banana | -999.0 | berry |
7 | apple | 5.0 | pome |
8 | blackberry | 6.0 | berry |
# Remove numbers that we know are wrong
# Sometimes NaN are coded with large negative or really unreasonable numbers
-999,np.nan) data.replace(
food | ounces | category | |
---|---|---|---|
0 | orange | 4.0 | citrus |
1 | blueberry | NaN | berry |
2 | orange | 12.0 | citrus |
3 | banana | 6.0 | berry |
4 | strawberry | 7.5 | berry |
5 | orange | 8.0 | citrus |
6 | banana | NaN | berry |
7 | apple | 5.0 | pome |
8 | blackberry | 6.0 | berry |
Renaming
There are lots of ways to rename things on both the column labels and the index labels in a data frame. Here are a few examples of doing this.
= pd.DataFrame(np.arange(12).reshape((3, 4)),
data =["Ohio", "Colorado", "New York"],
index=["one", "two", "three", "four"])
columns data
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
New York | 8 | 9 | 10 | 11 |
# We could change things directly
# Here we send in a dictionary that gives the label as a key and the new label as the value
# Columns
= {'three':'THREE', 'four':'FOUR'}, inplace=True)
data.rename(columns data
one | two | THREE | FOUR | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
New York | 8 | 9 | 10 | 11 |
# Rows
= {'Ohio':'California'}, inplace=True)
data.rename(index data
one | two | THREE | FOUR | |
---|---|---|---|---|
California | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
New York | 8 | 9 | 10 | 11 |
# We could also define a function that will transform the data through a map
def new_names(x):
return x[:4].upper()
= data.index.map(new_names)
data.index data
one | two | THREE | FOUR | |
---|---|---|---|---|
CALI | 0 | 1 | 2 | 3 |
COLO | 4 | 5 | 6 | 7 |
NEW | 8 | 9 | 10 | 11 |
# define a dictionary map
= {'one':1,'two':2,'three':3,'four':4}
text_to_num # define a function
def change_to_num(x):
return text_to_num[x.lower()]
= data.columns.map(change_to_num)
data.columns data
1 | 2 | 3 | 4 | |
---|---|---|---|---|
CALI | 0 | 1 | 2 | 3 |
COLO | 4 | 5 | 6 | 7 |
NEW | 8 | 9 | 10 | 11 |
You Try
Run the cell below to get your data. Then update the column and index names, using the renaming methods above, so that they are consistent and easy to use. Your choice for how you want the final labels to be!
= pd.DataFrame(np.arange(12).reshape((3, 4)),
data =["red green", "Blue_green", "green "],
index=["ONE", "two", "3", "Four"])
columns data
ONE | two | 3 | Four | |
---|---|---|---|---|
red green | 0 | 1 | 2 | 3 |
Blue_green | 4 | 5 | 6 | 7 |
green | 8 | 9 | 10 | 11 |
# Your code here
Discretization and Binning
Sometimes you will want to take continuous data and represent it as bins. This is often done in a histogram, but in data science maybe you want to define categories based on a continuous numerical variable. For example, maybe you want high, medium, and low income classes. This is where binning will help.
# Get the list of ages
= [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
ages # Choose the edges of your bins
# Here we will do 18 and under, 19-24, 25-34, ....
= [18, 25, 35, 60, 100]
bins # Have pandas cut the data into bins
= pd.cut(ages, bins)
age_categories age_categories
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
# Now put this data into a DataFrame
= pd.DataFrame()
data 'age'] = ages
data['range'] = age_categories
data[ data
age | range | |
---|---|---|
0 | 20 | (18, 25] |
1 | 22 | (18, 25] |
2 | 25 | (18, 25] |
3 | 27 | (25, 35] |
4 | 21 | (18, 25] |
5 | 23 | (18, 25] |
6 | 37 | (35, 60] |
7 | 31 | (25, 35] |
8 | 61 | (60, 100] |
9 | 45 | (35, 60] |
10 | 41 | (35, 60] |
11 | 32 | (25, 35] |
# The object returned by .cut() has some other features
# You can look at the categories that were in the data set
age_categories.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')
# You can also get category codes
age_categories.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
# Lets add the codes to the DataFrame
'code'] = age_categories.codes
data[ data
age | range | code | |
---|---|---|---|
0 | 20 | (18, 25] | 0 |
1 | 22 | (18, 25] | 0 |
2 | 25 | (18, 25] | 0 |
3 | 27 | (25, 35] | 1 |
4 | 21 | (18, 25] | 0 |
5 | 23 | (18, 25] | 0 |
6 | 37 | (35, 60] | 2 |
7 | 31 | (25, 35] | 1 |
8 | 61 | (60, 100] | 3 |
9 | 45 | (35, 60] | 2 |
10 | 41 | (35, 60] | 2 |
11 | 32 | (25, 35] | 1 |
Notice that now we have both the range for the category and a code that puts them into discrete numerical groups.
In the range column the notation that you see is:
- ( means inclusive
- [ means exclusive
so you would read the range \((18,25]\) to be ages 18 but less than 25.
# You can also give pandas a number of bins to use and
# it will compute equal length bins to put your data into
4) pd.cut(ages,
[(19.959, 30.25], (19.959, 30.25], (19.959, 30.25], (19.959, 30.25], (19.959, 30.25], ..., (30.25, 40.5], (50.75, 61.0], (40.5, 50.75], (40.5, 50.75], (30.25, 40.5]]
Length: 12
Categories (4, interval[float64, right]): [(19.959, 30.25] < (30.25, 40.5] < (40.5, 50.75] < (50.75, 61.0]]
You Try
Run the cell below to create a random list of numbers to represent ages in your population. Then make up your own age range categories (at least 5) and use .cut()
to break the data into discrete categories. Create a data frame that contains the age, the age range, and the age category code.
= [np.random.randint(15,100) for i in range(40)] ages
# Your code here
Detecting Outliers
Sometimes you want to be able to detect outliers in a dataset, however this process can take a variety of operations and is highly dependent on how you define outliers in your data. Here is an example data set that just randomly assigns values based on a normal distribution
= pd.DataFrame(np.random.standard_normal((1000, 1)))
data data.describe()
0 | |
---|---|
count | 1000.000000 |
mean | -0.028999 |
std | 1.022936 |
min | -3.284727 |
25% | -0.706225 |
50% | 0.014594 |
75% | 0.642945 |
max | 3.360357 |
# You could write a function
# These can sometimes get complicated
def check_outlier(x,q1 = data.quantile(0.25)[0],q3 = data.quantile(0.75)[0]):
'''
This function calculates the quantiles and then applies the outlier
checker based on the interquartile range.
It can only accept one column at a time.
NOTE: When using a function as a map, you can only pass in one variable.
'''
= q3 - q1
IQR # Determine bounds
= q1 - 1.5 * IQR
lower_bound = q3 + 1.5 * IQR
upper_bound if x>upper_bound:
return True
elif x<lower_bound:
return True
else:
return False
# And then apply a map
'outlier']=data.map(check_outlier)
data[ data
0 | outlier | |
---|---|---|
0 | 0.158828 | False |
1 | -0.651900 | False |
2 | 0.663735 | False |
3 | 0.062495 | False |
4 | -0.187718 | False |
... | ... | ... |
995 | 0.655867 | False |
996 | -0.673719 | False |
997 | -0.958994 | False |
998 | 0.407369 | False |
999 | -1.239003 | False |
1000 rows × 2 columns
# Check how many outliers we have
'outlier'].value_counts() data[
outlier
False 990
True 10
Name: count, dtype: int64
# Mask out the outliers
= data[data['outlier'] != True]
data data
0 | outlier | |
---|---|---|
0 | 0.158828 | False |
1 | -0.651900 | False |
2 | 0.663735 | False |
3 | 0.062495 | False |
4 | -0.187718 | False |
... | ... | ... |
995 | 0.655867 | False |
996 | -0.673719 | False |
997 | -0.958994 | False |
998 | 0.407369 | False |
999 | -1.239003 | False |
990 rows × 2 columns
# you could also just choose an upper value
# then correct for anything outside that value
# here is our data
= pd.DataFrame(np.random.standard_normal((1000, 4)))
data
display(data)
# here is our update
abs() > 1] = np.sign(data) *2
data[data. display(data)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -1.168379 | -0.506053 | -1.229291 | -0.482408 |
1 | 1.032789 | 0.366741 | -0.106486 | -0.740361 |
2 | 0.591173 | -0.800954 | 0.240408 | -0.298359 |
3 | 0.577323 | -0.121570 | -0.102918 | 2.115196 |
4 | 0.135363 | -0.331980 | -0.637856 | 0.243791 |
... | ... | ... | ... | ... |
995 | 1.681474 | 0.150533 | -0.484748 | 0.759062 |
996 | 1.019167 | 0.175946 | -0.514699 | 0.038862 |
997 | -0.865868 | -0.412088 | 1.174607 | -0.246017 |
998 | 0.018156 | -0.132022 | -0.874302 | 0.454300 |
999 | 0.388876 | 0.284984 | -1.110518 | -1.463894 |
1000 rows × 4 columns
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -2.000000 | -0.506053 | -2.000000 | -0.482408 |
1 | 2.000000 | 0.366741 | -0.106486 | -0.740361 |
2 | 0.591173 | -0.800954 | 0.240408 | -0.298359 |
3 | 0.577323 | -0.121570 | -0.102918 | 2.000000 |
4 | 0.135363 | -0.331980 | -0.637856 | 0.243791 |
... | ... | ... | ... | ... |
995 | 2.000000 | 0.150533 | -0.484748 | 0.759062 |
996 | 2.000000 | 0.175946 | -0.514699 | 0.038862 |
997 | -0.865868 | -0.412088 | 2.000000 | -0.246017 |
998 | 0.018156 | -0.132022 | -0.874302 | 0.454300 |
999 | 0.388876 | 0.284984 | -2.000000 | -2.000000 |
1000 rows × 4 columns
You Try
Explain in great detail what each of the lines in the above cell did to both create and then update the data frame.
# Your WORDS here - change this cell to markdown
Permutation and Random Sampling
Often when doing a data science project you will want to take random samples of your data. This might be to help you avoid bias in the ordering of your data. It might be to create model training and testing data sets. Or maybe your data set is too big and you want to start with a smaller subset of the data. There are lots of ways to do this:
- NUMPY - has
random.permutation()
which will give you a list of integers in a range that are permuted (rearranged) randomly. - PANDAS - has a function
.sample()
that can take a sample from a DataFrame or series. - Other Packages - later this semester we will see other packages like sklearn that can create test-train splits of your data.
# Here is some fake data
= pd.DataFrame(np.arange(5 * 7).reshape((5, 7)))
df df
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
1 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
2 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
3 | 21 | 22 | 23 | 24 | 25 | 26 | 27 |
4 | 28 | 29 | 30 | 31 | 32 | 33 | 34 |
# Notice how this data is really ordered - maybe this is not good for your analysis
# We will grab the rows in a different order randomly
= 5
num_rows = np.random.permutation(num_rows)
sample print(sample)
# Now get the rows in that order
df.take(sample)
[4 0 3 2 1]
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
4 | 28 | 29 | 30 | 31 | 32 | 33 | 34 |
0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
3 | 21 | 22 | 23 | 24 | 25 | 26 | 27 |
2 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
1 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
# We can also graph the columns in a different order
= df.shape[1]
num_cols = np.random.permutation(num_cols)
sample print(sample)
# Now get the columns in that order
=1) df.take(sample, axis
[6 0 4 2 3 1 5]
6 | 0 | 4 | 2 | 3 | 1 | 5 | |
---|---|---|---|---|---|---|---|
0 | 6 | 0 | 4 | 2 | 3 | 1 | 5 |
1 | 13 | 7 | 11 | 9 | 10 | 8 | 12 |
2 | 20 | 14 | 18 | 16 | 17 | 15 | 19 |
3 | 27 | 21 | 25 | 23 | 24 | 22 | 26 |
4 | 34 | 28 | 32 | 30 | 31 | 29 | 33 |
# From within pandas we can get a sub-sample of our data
=3) df.sample(n
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
1 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
3 | 21 | 22 | 23 | 24 | 25 | 26 | 27 |
4 | 28 | 29 | 30 | 31 | 32 | 33 | 34 |
# Sometimes you don't mind choosing the same row twice
# replace=True lets you sample the same row more than once
=30, replace=True) df.sample(n
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
3 | 21 | 22 | 23 | 24 | 25 | 26 | 27 |
2 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
1 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
2 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
1 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
3 | 21 | 22 | 23 | 24 | 25 | 26 | 27 |
1 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
2 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
3 | 21 | 22 | 23 | 24 | 25 | 26 | 27 |
2 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
4 | 28 | 29 | 30 | 31 | 32 | 33 | 34 |
4 | 28 | 29 | 30 | 31 | 32 | 33 | 34 |
2 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
1 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
1 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
3 | 21 | 22 | 23 | 24 | 25 | 26 | 27 |
4 | 28 | 29 | 30 | 31 | 32 | 33 | 34 |
0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
1 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
4 | 28 | 29 | 30 | 31 | 32 | 33 | 34 |
4 | 28 | 29 | 30 | 31 | 32 | 33 | 34 |
2 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
4 | 28 | 29 | 30 | 31 | 32 | 33 | 34 |
0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
1 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
4 | 28 | 29 | 30 | 31 | 32 | 33 | 34 |
0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
Dummy Variables
Dummy variables are variables that take the place of something in your data set. They are especially useful for classifying categorical data. In these cases you replace a column with categories with several columns of 0 or 1 to represent whether or not (True/False) the observation belongs to the category. Here is an example:
Lets say we have some categorical data that we want to interact with numerically. Below you will see the key column that contains a,b,c. Lets generate dummy variables for this data.
= pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],
df "data1": range(6)})
df
key | data1 | |
---|---|---|
0 | b | 0 |
1 | b | 1 |
2 | a | 2 |
3 | c | 3 |
4 | a | 4 |
5 | b | 5 |
= pd.get_dummies(df["key"])
dummies dummies
a | b | c | |
---|---|---|---|
0 | False | True | False |
1 | False | True | False |
2 | True | False | False |
3 | False | False | True |
4 | True | False | False |
5 | False | True | False |
= pd.get_dummies(df["key"],dtype=float)
dummies dummies
a | b | c | |
---|---|---|---|
0 | 0.0 | 1.0 | 0.0 |
1 | 0.0 | 1.0 | 0.0 |
2 | 1.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 1.0 |
4 | 1.0 | 0.0 | 0.0 |
5 | 0.0 | 1.0 | 0.0 |
= pd.get_dummies(df["key"],dtype=int)
dummies dummies
a | b | c | |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 0 | 1 |
4 | 1 | 0 | 0 |
5 | 0 | 1 | 0 |
# Now add this to the original data frame
= df.join(dummies)
df_new df_new
key | data1 | a | b | c | |
---|---|---|---|---|---|
0 | b | 0 | 0 | 1 | 0 |
1 | b | 1 | 0 | 1 | 0 |
2 | a | 2 | 1 | 0 | 0 |
3 | c | 3 | 0 | 0 | 1 |
4 | a | 4 | 1 | 0 | 0 |
5 | b | 5 | 0 | 1 | 0 |
Now we have three new columns each of which represents a true/false for whether or not the data had that key.
Here is a slightly more complicated example.
These files contain 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000. Thanks to Shyong Lam and Jon Herlocker for cleaning up and generating the data set. See README for more information
= ["movie_id", "title", "genres"]
mnames = pd.read_table("data/movielens/movies.dat", sep="::",
movies =None, names=mnames, engine="python")
header10] movies[:
movie_id | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
5 | 6 | Heat (1995) | Action|Crime|Thriller |
6 | 7 | Sabrina (1995) | Comedy|Romance |
7 | 8 | Tom and Huck (1995) | Adventure|Children's |
8 | 9 | Sudden Death (1995) | Action |
9 | 10 | GoldenEye (1995) | Action|Adventure|Thriller |
Notice how we have a bunch of genres. Maybe we want to break this up into categories that can be used mathematically - meanings we need numbers instead of words. The pandas .get_dummies()
command can help us here!
First lets get the counts for the different category combinations:
'genres'].value_counts() movies[
genres
Drama 843
Comedy 521
Horror 178
Comedy|Drama 162
Comedy|Romance 142
...
Drama|Film-Noir 1
Comedy|Horror|Sci-Fi 1
Adventure|Drama|Romance|Sci-Fi 1
Adventure|Animation|Sci-Fi 1
Adventure|Crime|Sci-Fi|Thriller 1
Name: count, Length: 301, dtype: int64
Notice how there are 301 different categories. Why so many? Well if we look at the names, some movies are part of multiple categories and those categories are split up by the | character. How could we find out the number of Genres?
# I will do this by writing a quick for loop
= set()
genre_set for g in movies['genres']:
for n in g.split('|'):
genre_set.add(n)
print(len(genre_set))
genre_set
18
{'Action',
'Adventure',
'Animation',
"Children's",
'Comedy',
'Crime',
'Documentary',
'Drama',
'Fantasy',
'Film-Noir',
'Horror',
'Musical',
'Mystery',
'Romance',
'Sci-Fi',
'Thriller',
'War',
'Western'}
Okay so really there are 18 total genres. Let’s use pandas to break this into dummy variables.
= movies['genres'].str.get_dummies("|")
dummies dummies
Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3878 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3879 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3880 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3881 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3882 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3883 rows × 18 columns
Notice that this looks at the ‘genres’ element, breaks up the string by the | character, and then assigns it a 1 in any category that it belongs to. If we look at the first row, which represents Toy Story, we can see this movie belongs to the categories: Animation, Children’s, Comedy.
Let’s add these dummy variables to our original DataFrame
# The .add_prefix() function adds a name to the beginning of the column
# This helps specify which variable the dummy is representing
= movies.join(dummies.add_prefix('genre_'))
movies_new movies_new
movie_id | title | genres | genre_Action | genre_Adventure | genre_Animation | genre_Children's | genre_Comedy | genre_Crime | genre_Documentary | ... | genre_Fantasy | genre_Film-Noir | genre_Horror | genre_Musical | genre_Mystery | genre_Romance | genre_Sci-Fi | genre_Thriller | genre_War | genre_Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children's|Comedy | 0 | 0 | 1 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy | 0 | 1 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 5 | Father of the Bride Part II (1995) | Comedy | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3878 | 3948 | Meet the Parents (2000) | Comedy | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3879 | 3949 | Requiem for a Dream (2000) | Drama | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3880 | 3950 | Tigerland (2000) | Drama | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3881 | 3951 | Two Family House (2000) | Drama | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3882 | 3952 | Contender, The (2000) | Drama|Thriller | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3883 rows × 21 columns
String Manipulation
One of the most common things you will have to do is interact with data that is formatted as strings (words). This can happen because of the way you saved your data, maybe everything got turned into strings, or because of the way the data was originally entered.
- Simple strings to numbers
- Object methods
Anything that is entered into python with quotes or read in as a string will be assumed to be a string. Even though the number below is clearly, to our human minds, a number, Python sees it as a word.
= '3'
number number
'3'
# We can change it into an integer
int(number)
3
# We can change it into a float
float(number)
3.0
# We can change a number back into a string
str(3.0)
'3.0'
If we start with a more complicated string there are lots of things we can do to alter it. The .split()
function can split up a string how ever you want! It turns the string into a list of substrings.
= "a,b, guido"
val ",") val.split(
['a', 'b', ' guido']
' ') val.split(
['a,b,', 'guido']
# Notice how the extra whitespace is annoying here!
= val.split(',')
string_list print(string_list)
# The .strip() will get rid of any white space
# Here is an example of a for loop inside a list
# This is called list comprehension
= [x.strip() for x in string_list]
new_string_list print(new_string_list)
['a', 'b', ' guido']
['a', 'b', 'guido']
# Maybe we then want to put these pieces together
# the .join() command will cycle through the list and join up the strings
# with the given string
'.'.join(new_string_list)
'a.b.guido'
# You can check for membership
print('guido' in new_string_list)
print('g' in 'guido')
True
True
# You can find things inside a string
= 'Hello World'
my_string print(my_string.find('W'))
6] my_string[
6
'W'
You Try
Break the following string up into a list of strings using string manipulation functions. See if you can create a list like this:
['Joanna','Bieri','Redlands','Keep up the good work!']
try to get all the capitals and spacing correct!
NOTE - lots of different processes will result in this final list, there is not one right way to do this!
= 'joanna_bieri@redlands.edu says: Keep up the good work!' a_string
# Your code here
There are so many different string methods! Here are some of my favorites:
.replace(old text, new text)
replace text.rstrip()
strip from the right end of the string.lstrip()
strip from the left end of the string.lower()
make the string all lower case.upper()
make the string all upper case.title()
capitalize each first letter
For more advanced string manipulation you can use regex.
import re
look in the book or online for more information.
Many of the string methods are also implemented directly in pandas and can be applied directly to Series data.
Categorical Data
Pandas has a build in data type called Categorical. This helps encode certain columns or parts of your data as being categorical, rather than just an assortment of strings. Categorical has the advantage of being better for memory and for sorting and organizing data. Here is an example:
# Simulate a column with repeated strings
# Do not turn the data into categorical - keep it as strings
= 1_000_000
n = pd.DataFrame({
df 'city': np.random.choice(['New York', 'Los Angeles', 'Chicago'], size=n)
})
df
city | |
---|---|
0 | New York |
1 | Los Angeles |
2 | New York |
3 | New York |
4 | New York |
... | ... |
999995 | New York |
999996 | Chicago |
999997 | New York |
999998 | New York |
999999 | New York |
1000000 rows × 1 columns
# Now create a column with the same data, but tell pandas it is categorical
'city_cat'] = df['city'].astype('category')
df[ df
city | city_cat | |
---|---|---|
0 | New York | New York |
1 | Los Angeles | Los Angeles |
2 | New York | New York |
3 | New York | New York |
4 | New York | New York |
... | ... | ... |
999995 | New York | New York |
999996 | Chicago | Chicago |
999997 | New York | New York |
999998 | New York | New York |
999999 | New York | New York |
1000000 rows × 2 columns
# Look at the memory usage of each column
print(df['city'].memory_usage(deep=True))
print(df['city_cat'].memory_usage(deep=True))
57667564
1000413
# Lets get the list of cities and order them reverse alphabetically
= list(df['city'].value_counts().keys().sort_values(ascending=False))
levels levels
['New York', 'Los Angeles', 'Chicago']
# Now create a categorical data column that includes the levels as categories
# and tells pandas that the categories are ordered.
'city_cat_levels'] = pd.Categorical(df['city'], categories=levels, ordered=True)
df[ df
city | city_cat | city_cat_levels | |
---|---|---|---|
0 | New York | New York | New York |
1 | Los Angeles | Los Angeles | Los Angeles |
2 | New York | New York | New York |
3 | New York | New York | New York |
4 | New York | New York | New York |
... | ... | ... | ... |
999995 | New York | New York | New York |
999996 | Chicago | Chicago | Chicago |
999997 | New York | New York | New York |
999998 | New York | New York | New York |
999999 | New York | New York | New York |
1000000 rows × 3 columns
# Now because we called our city_cat_levels column categorical and assigned levels
# Now we can compare using < and >
# And assign an ordering to our data
print(df['city_cat_levels'] > 'Los Angeles') # True for 'Chicago'
0 False
1 False
2 False
3 False
4 False
...
999995 False
999996 True
999997 False
999998 False
999999 False
Name: city_cat_levels, Length: 1000000, dtype: bool
# The standard comparison only gives you the option to check alphabetically
print(df['city'] > 'Los Angeles')
0 True
1 False
2 True
3 True
4 True
...
999995 True
999996 False
999997 True
999998 True
999999 True
Name: city, Length: 1000000, dtype: bool