Intermediate Data Science

Data Cleaning and Preparation

Author

Joanna Bieri
DATA201

Intermediate Data Science

Important Information

Data Cleaning

Often in data science a huge portion of your time will be spent loading, cleaning, transforming, and rearranging data. Sometimes you will have data sets that contain many missing variables, others will have bad formatting such as numbers being read in as strings. Sometimes you will need to create dummy variables or introduce new variables into your data. Often you will need to take sub-samples of your data for training and testing a model or simply because the data is too big to read in all at one time. Fortunately, pandas and python have lots of tools to help you in this process. Here are the main topics we will cover:

  1. Handling Missing Data
  2. Transforming Data
  3. String Manipulation
  4. Categorical Data
# Some basic package imports
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

Handling Missing Data

Missing data can happen for a wide variety of reasons:

  • The data really does not exist for a certain observation: In population data babies would not have a date of marriage or a list of children.
  • The data was improperly entered: Errors are easy to make.
  • The data set was damaged: Reading or writing issues happen.
  • The missing data (None, NaN, or NA) means something important: Maybe a student did not take a test and that is important in your analysis.

Pandas uses floating point NaN (Not a Number) to represent missing data.

np.nan
nan
type(np.nan)
float

It will interpret NA (Not Available) as a Python None. This will be considered NaN when doing analysis.

string_data = pd.Series(["aardvark", np.nan, None, "avocado"])
string_data
0    aardvark
1         NaN
2        None
3     avocado
dtype: object

Methods to handle missing data:

  1. isna() / isnull()
  • Description: Detect missing values.

    df.isna()
    df['column'].isnull()
  1. notna() / notnull()
  • Description: Detect non-missing values.

    df.notna()
    df['column'].notnull()
  1. dropna()
  • Description: Remove missing values.

    df.dropna()  # Drop rows with any NA values
    df.dropna(axis=1)  # Drop columns with any NA values
  1. fillna()
  • Description: Fill NA values with a specified value or method.

    df.fillna(0)  # Replace NA with 0
  1. replace()
  • Description: Replace specified values including NA.

    df.replace(to_replace=np.nan, value=0)
string_data.isna()
0    False
1     True
2     True
3    False
dtype: bool
string_data.notna()
0     True
1    False
2    False
3     True
dtype: bool
string_data.fillna(0)
0    aardvark
1           0
2           0
3     avocado
dtype: object
string_data.replace(np.nan,0)
0    aardvark
1           0
2           0
3     avocado
dtype: object

You will notice that these methods do not write over the data in memory. In other words, if we print with is in string_data we will see all the nans are still there:

string_data
0    aardvark
1         NaN
2        None
3     avocado
dtype: object

If you want to make the changes in memory you need to add the flag: in_place=True, or reset the variable

string_data.fillna(0,inplace=True)
string_data = string_data.fillna(0)
string_data.fillna(0,inplace=True)
string_data
0    aardvark
1           0
2           0
3     avocado
dtype: object

BEWARE OF PYTHON mutable!

When you are saving data from one list to the next you should be very careful about how you do that! Python lists are mutable this means that when you set one list equal to another it does not make a new copy in memory, instead it copies a reference. Here is an example:

list1 = [5,4,3,2,1]
list2 = list1

print('Here is list2, it looks like a copy!')
print(list2)

print('Now we will change something in list2')
list2[0] = 10
print(list2)

print('Now look at list1')
print(list1)

print('BUT WE DIDN"T CHANGE LIST1 !!!!! WHY DID IT CHANGE????')
Here is list2, it looks like a copy!
[5, 4, 3, 2, 1]
Now we will change something in list2
[10, 4, 3, 2, 1]
Now look at list1
[10, 4, 3, 2, 1]
BUT WE DIDN"T CHANGE LIST1 !!!!! WHY DID IT CHANGE????

It changed because list1 and list2 point to the same object in memory. How do we stop this from happening? We need to use .copy() when making a copy of a list to get new memory allocated. Lets do the same computation but this time use .copy()

list1 = [5,4,3,2,1]
list2 = list1.copy() ## THIS IS OUR ONLY CHANGE

print('Here is list2, it looks like a copy!')
print(list2)

print('Now we will change something in list2')
list2[0] = 10
print(list2)

print('Now look at list1')
print(list1)
print('List1 did not change')
Here is list2, it looks like a copy!
[5, 4, 3, 2, 1]
Now we will change something in list2
[10, 4, 3, 2, 1]
Now look at list1
[5, 4, 3, 2, 1]
List1 did not change

Moral of the mutable story

If you are creating a new variable by setting it equal to another list and you want to make changes to one without changing the other you should use .copy(). This is true of all mutable python types:

  • list
  • dict
  • set
  • pd.DataFrame
  • pd.Series
  • np.array

Filtering out Missing Data

You want to be careful and intentional when filtering out missing data. We will explore the process with a DataFrame that contains lots of missing data.

data = pd.DataFrame([[1., 6.5, 3.], [ np.nan, np.nan, 1.],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data
0 1 2
0 1.0 6.5 3.0
1 NaN NaN 1.0
2 NaN NaN NaN
3 NaN 6.5 3.0
# We can drop all NaNs
data.dropna()
0 1 2
0 1.0 6.5 3.0
# We can drop only rows that are completely NaNs
data.dropna(how='all')
0 1 2
0 1.0 6.5 3.0
1 NaN NaN 1.0
3 NaN 6.5 3.0
# We can drop only rows that contain more than two NaN's
data.dropna(thresh=2)
0 1 2
0 1.0 6.5 3.0
3 NaN 6.5 3.0
# We can drop only columns that contain more than two NaN's
data.dropna(thresh=2,axis=1)
1 2
0 6.5 3.0
1 NaN 1.0
2 NaN NaN
3 6.5 3.0

Notice that each of these decisions creates a very different result! You should also notice that some optional commands are pretty common:

  • axis=0 do the calculation to the rows - usually default
  • axis=1 do the calculation to the columns
# data.dropna?

Filling in Missing Data

Sometimes you want to replace missing data in a DataFrame in a managed way so that it has predictable effects on the rest of your analysis. Most of the time you will use .fillna() but there are some nice optional arguments that let you customize the command. Remember, to make changes in memory you need to add inplace=True.

data
0 1 2
0 1.0 6.5 3.0
1 NaN NaN 1.0
2 NaN NaN NaN
3 NaN 6.5 3.0
# Change NaN to zero
data.fillna(0)
0 1 2
0 1.0 6.5 3.0
1 0.0 0.0 1.0
2 0.0 0.0 0.0
3 0.0 6.5 3.0
# Change each column differently - use a dictionary!
# Keys are column names
# Values are the fill 
data.fillna({0:np.nan,1:'Hello',2:0})
0 1 2
0 1.0 6.5 3.0
1 NaN Hello 1.0
2 NaN Hello 0.0
3 NaN 6.5 3.0
# Fill with a calculation - here will will fill with the column means
data.fillna(data.mean())
0 1 2
0 1.0 6.5 3.000000
1 1.0 6.5 1.000000
2 1.0 6.5 2.333333
3 1.0 6.5 3.000000

Data Transformation

Next we will talk a bit more about cleaning data:

  1. Removing Duplicate Data
  2. Replacing Data
  3. Renaming
  4. Discretizing and Binning
  5. Outliers
  6. Sampling
  7. Dummy Variables

Duplicate Data

Sometimes data sets will have duplicate variables, maybe they are not identical but the represent the same thing. Here column k1 has the words and k2 has the numerical values:

data = pd.DataFrame({"k1": ["one", "two"] * 3 + ["two"],
                     "k2": [1, 1, 2, 3, 3, 4, 4]})
data
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4

Notice that observations 5 and 6 are identical! We can check for this using the .duplicated() and drop_duplicates() command.

data.duplicated()
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool
data.drop_duplicates()
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4

You can filter duplicates on a subset of the data if you specify the rows that you want to filter. Lets add a column to our data and then drop duplicates.

data["v1"] = range(7)
data['v2'] = [1 for i in range(7)]
data
k1 k2 v1 v2
0 one 1 0 1
1 two 1 1 1
2 one 2 2 1
3 two 3 3 1
4 one 3 4 1
5 two 4 5 1
6 two 4 6 1
# We can drop any duplicates just in the k1 column - returns just the first two observations
# Notice we still get back all the other data in those two rows
data.drop_duplicates(subset=['k1'])
k1 k2 v1 v2
0 one 1 0 1
1 two 1 1 1
# We can look for duplicates across two different columns
data.drop_duplicates(subset=['k2','v2'])
k1 k2 v1 v2
0 one 1 0 1
2 one 2 2 1
3 two 3 3 1
5 two 4 5 1

Transforming Data and Mapping

We will imagine that you have some data that tells you about the amount of different types of fruit. Say you want to add a new column to this data that says what kind of fruit each one is: citrus, berry, tropical, or pome. You can use a dictionary and the .map() function to add this information.

# Here is our made up data
data = pd.DataFrame({
    "food": ["orange", "blueberry", "orange",
             "banana", "strawberry", "orange",
             "banana", "apple", "blackberry"],
    "ounces": [4, -999, 12, 6, 7.5, 8, -999, 5, 6]
})

data
food ounces
0 orange 4.0
1 blueberry -999.0
2 orange 12.0
3 banana 6.0
4 strawberry 7.5
5 orange 8.0
6 banana -999.0
7 apple 5.0
8 blackberry 6.0
# Here is a dictionary mapping fruits to categories
food_to_category = {
    "orange": "citrus",
    "blueberry": "berry",
    "strawberry": "berry",
    "blackberry": "berry",
    "banana": "tropical",
    "apple": "pome"
}
# Now we will add the column
data['category'] = data['food'].map(food_to_category)
data
food ounces category
0 orange 4.0 citrus
1 blueberry -999.0 berry
2 orange 12.0 citrus
3 banana 6.0 tropical
4 strawberry 7.5 berry
5 orange 8.0 citrus
6 banana -999.0 tropical
7 apple 5.0 pome
8 blackberry 6.0 berry

Replacing Values

We have seen above how to replace NaN values, but what if there were other types of things you wanted to replace in the dataset? The .replace() function can replace any data you want with replacement data.

# Change the category given to banana
data.replace('tropical','berry',inplace=True)
data
food ounces category
0 orange 4.0 citrus
1 blueberry -999.0 berry
2 orange 12.0 citrus
3 banana 6.0 berry
4 strawberry 7.5 berry
5 orange 8.0 citrus
6 banana -999.0 berry
7 apple 5.0 pome
8 blackberry 6.0 berry
# Remove numbers that we know are wrong
# Sometimes NaN are coded with large negative or really unreasonable numbers
data.replace(-999,np.nan)
food ounces category
0 orange 4.0 citrus
1 blueberry NaN berry
2 orange 12.0 citrus
3 banana 6.0 berry
4 strawberry 7.5 berry
5 orange 8.0 citrus
6 banana NaN berry
7 apple 5.0 pome
8 blackberry 6.0 berry

Renaming

There are lots of ways to rename things on both the column labels and the index labels in a data frame. Here are a few examples of doing this.

data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=["Ohio", "Colorado", "New York"],
                    columns=["one", "two", "three", "four"])
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
# We could change things directly
# Here we send in a dictionary that gives the label as a key and the new label as the value
# Columns
data.rename(columns = {'three':'THREE', 'four':'FOUR'}, inplace=True)
data
one two THREE FOUR
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
# Rows
data.rename(index = {'Ohio':'California'}, inplace=True)
data
one two THREE FOUR
California 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
# We could also define a function that will transform the data through a map
def new_names(x):
    return x[:4].upper()

data.index = data.index.map(new_names)
data
one two THREE FOUR
CALI 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
# define a dictionary map
text_to_num = {'one':1,'two':2,'three':3,'four':4}
# define a function
def change_to_num(x):
    return text_to_num[x.lower()]

data.columns = data.columns.map(change_to_num)
data
1 2 3 4
CALI 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11

You Try

Run the cell below to get your data. Then update the column and index names, using the renaming methods above, so that they are consistent and easy to use. Your choice for how you want the final labels to be!

data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=["red green", "Blue_green", "green  "],
                    columns=["ONE", "two", "3", "Four"])
data
ONE two 3 Four
red green 0 1 2 3
Blue_green 4 5 6 7
green 8 9 10 11
# Your code here

Discretization and Binning

Sometimes you will want to take continuous data and represent it as bins. This is often done in a histogram, but in data science maybe you want to define categories based on a continuous numerical variable. For example, maybe you want high, medium, and low income classes. This is where binning will help.

# Get the list of ages
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
# Choose the edges of  your bins
# Here we will do 18 and under, 19-24, 25-34, ....
bins = [18, 25, 35, 60, 100]
# Have pandas cut the data into bins
age_categories = pd.cut(ages, bins)
age_categories
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
# Now put this data into a DataFrame
data = pd.DataFrame()
data['age'] = ages
data['range'] = age_categories
data
age range
0 20 (18, 25]
1 22 (18, 25]
2 25 (18, 25]
3 27 (25, 35]
4 21 (18, 25]
5 23 (18, 25]
6 37 (35, 60]
7 31 (25, 35]
8 61 (60, 100]
9 45 (35, 60]
10 41 (35, 60]
11 32 (25, 35]
# The object returned by .cut() has some other features
# You can look at the categories that were in the data set
age_categories.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')
# You can also get category codes
age_categories.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
# Lets add the codes to the DataFrame
data['code'] = age_categories.codes
data
age range code
0 20 (18, 25] 0
1 22 (18, 25] 0
2 25 (18, 25] 0
3 27 (25, 35] 1
4 21 (18, 25] 0
5 23 (18, 25] 0
6 37 (35, 60] 2
7 31 (25, 35] 1
8 61 (60, 100] 3
9 45 (35, 60] 2
10 41 (35, 60] 2
11 32 (25, 35] 1

Notice that now we have both the range for the category and a code that puts them into discrete numerical groups.

In the range column the notation that you see is:

  • ( means inclusive
  • [ means exclusive

so you would read the range \((18,25]\) to be ages 18 but less than 25.

# You can also give pandas a number of bins to use and 
# it will compute equal length bins to put your data into
pd.cut(ages,4)
[(19.959, 30.25], (19.959, 30.25], (19.959, 30.25], (19.959, 30.25], (19.959, 30.25], ..., (30.25, 40.5], (50.75, 61.0], (40.5, 50.75], (40.5, 50.75], (30.25, 40.5]]
Length: 12
Categories (4, interval[float64, right]): [(19.959, 30.25] < (30.25, 40.5] < (40.5, 50.75] < (50.75, 61.0]]

You Try

Run the cell below to create a random list of numbers to represent ages in your population. Then make up your own age range categories (at least 5) and use .cut() to break the data into discrete categories. Create a data frame that contains the age, the age range, and the age category code.

ages = [np.random.randint(15,100) for i in range(40)]
# Your code here

Detecting Outliers

Sometimes you want to be able to detect outliers in a dataset, however this process can take a variety of operations and is highly dependent on how you define outliers in your data. Here is an example data set that just randomly assigns values based on a normal distribution

data = pd.DataFrame(np.random.standard_normal((1000, 1)))
data.describe()
0
count 1000.000000
mean -0.028999
std 1.022936
min -3.284727
25% -0.706225
50% 0.014594
75% 0.642945
max 3.360357
# You could write a function
# These can sometimes get complicated
def check_outlier(x,q1 = data.quantile(0.25)[0],q3 = data.quantile(0.75)[0]):
    '''
    This function calculates the quantiles and then applies the outlier 
    checker based on the interquartile range.

    It can only accept one column at a time.

    NOTE: When using a function as a map, you can only pass in one variable.
    '''
    IQR = q3 - q1
    # Determine bounds
    lower_bound = q1 - 1.5 * IQR
    upper_bound = q3 + 1.5 * IQR
    if x>upper_bound:
        return True
    elif x<lower_bound:
        return True
    else:
        return False

# And then apply a map
data['outlier']=data.map(check_outlier)
data
0 outlier
0 0.158828 False
1 -0.651900 False
2 0.663735 False
3 0.062495 False
4 -0.187718 False
... ... ...
995 0.655867 False
996 -0.673719 False
997 -0.958994 False
998 0.407369 False
999 -1.239003 False

1000 rows × 2 columns

# Check how many outliers we have
data['outlier'].value_counts()
outlier
False    990
True      10
Name: count, dtype: int64
# Mask out the outliers
data = data[data['outlier'] != True]
data
0 outlier
0 0.158828 False
1 -0.651900 False
2 0.663735 False
3 0.062495 False
4 -0.187718 False
... ... ...
995 0.655867 False
996 -0.673719 False
997 -0.958994 False
998 0.407369 False
999 -1.239003 False

990 rows × 2 columns

# you could also just choose an upper value
# then correct for anything outside that value
# here is our data
data = pd.DataFrame(np.random.standard_normal((1000, 4)))
display(data)

# here is our update
data[data.abs() > 1] = np.sign(data) *2
display(data)
0 1 2 3
0 -1.168379 -0.506053 -1.229291 -0.482408
1 1.032789 0.366741 -0.106486 -0.740361
2 0.591173 -0.800954 0.240408 -0.298359
3 0.577323 -0.121570 -0.102918 2.115196
4 0.135363 -0.331980 -0.637856 0.243791
... ... ... ... ...
995 1.681474 0.150533 -0.484748 0.759062
996 1.019167 0.175946 -0.514699 0.038862
997 -0.865868 -0.412088 1.174607 -0.246017
998 0.018156 -0.132022 -0.874302 0.454300
999 0.388876 0.284984 -1.110518 -1.463894

1000 rows × 4 columns

0 1 2 3
0 -2.000000 -0.506053 -2.000000 -0.482408
1 2.000000 0.366741 -0.106486 -0.740361
2 0.591173 -0.800954 0.240408 -0.298359
3 0.577323 -0.121570 -0.102918 2.000000
4 0.135363 -0.331980 -0.637856 0.243791
... ... ... ... ...
995 2.000000 0.150533 -0.484748 0.759062
996 2.000000 0.175946 -0.514699 0.038862
997 -0.865868 -0.412088 2.000000 -0.246017
998 0.018156 -0.132022 -0.874302 0.454300
999 0.388876 0.284984 -2.000000 -2.000000

1000 rows × 4 columns

You Try

Explain in great detail what each of the lines in the above cell did to both create and then update the data frame.

# Your WORDS here - change this cell to markdown

Permutation and Random Sampling

Often when doing a data science project you will want to take random samples of your data. This might be to help you avoid bias in the ordering of your data. It might be to create model training and testing data sets. Or maybe your data set is too big and you want to start with a smaller subset of the data. There are lots of ways to do this:

  • NUMPY - has random.permutation() which will give you a list of integers in a range that are permuted (rearranged) randomly.
  • PANDAS - has a function .sample() that can take a sample from a DataFrame or series.
  • Other Packages - later this semester we will see other packages like sklearn that can create test-train splits of your data.
# Here is some fake data
df = pd.DataFrame(np.arange(5 * 7).reshape((5, 7)))
df
0 1 2 3 4 5 6
0 0 1 2 3 4 5 6
1 7 8 9 10 11 12 13
2 14 15 16 17 18 19 20
3 21 22 23 24 25 26 27
4 28 29 30 31 32 33 34
# Notice how this data is really ordered - maybe this is not good for your analysis
# We will grab the rows in a different order randomly
num_rows = 5
sample = np.random.permutation(num_rows)
print(sample)

# Now get the rows in that order
df.take(sample)
[4 0 3 2 1]
0 1 2 3 4 5 6
4 28 29 30 31 32 33 34
0 0 1 2 3 4 5 6
3 21 22 23 24 25 26 27
2 14 15 16 17 18 19 20
1 7 8 9 10 11 12 13
# We can also graph the columns in a different order
num_cols = df.shape[1] 
sample = np.random.permutation(num_cols)
print(sample)

# Now get the columns in that order
df.take(sample, axis=1)
[6 0 4 2 3 1 5]
6 0 4 2 3 1 5
0 6 0 4 2 3 1 5
1 13 7 11 9 10 8 12
2 20 14 18 16 17 15 19
3 27 21 25 23 24 22 26
4 34 28 32 30 31 29 33
# From within pandas we can get a sub-sample of our data
df.sample(n=3)
0 1 2 3 4 5 6
1 7 8 9 10 11 12 13
3 21 22 23 24 25 26 27
4 28 29 30 31 32 33 34
# Sometimes you don't mind choosing the same row twice
# replace=True lets you sample the same row more than once
df.sample(n=30, replace=True)
0 1 2 3 4 5 6
3 21 22 23 24 25 26 27
2 14 15 16 17 18 19 20
1 7 8 9 10 11 12 13
0 0 1 2 3 4 5 6
2 14 15 16 17 18 19 20
1 7 8 9 10 11 12 13
3 21 22 23 24 25 26 27
1 7 8 9 10 11 12 13
2 14 15 16 17 18 19 20
3 21 22 23 24 25 26 27
2 14 15 16 17 18 19 20
4 28 29 30 31 32 33 34
4 28 29 30 31 32 33 34
2 14 15 16 17 18 19 20
1 7 8 9 10 11 12 13
0 0 1 2 3 4 5 6
0 0 1 2 3 4 5 6
1 7 8 9 10 11 12 13
3 21 22 23 24 25 26 27
4 28 29 30 31 32 33 34
0 0 1 2 3 4 5 6
1 7 8 9 10 11 12 13
4 28 29 30 31 32 33 34
4 28 29 30 31 32 33 34
2 14 15 16 17 18 19 20
4 28 29 30 31 32 33 34
0 0 1 2 3 4 5 6
1 7 8 9 10 11 12 13
4 28 29 30 31 32 33 34
0 0 1 2 3 4 5 6

Dummy Variables

Dummy variables are variables that take the place of something in your data set. They are especially useful for classifying categorical data. In these cases you replace a column with categories with several columns of 0 or 1 to represent whether or not (True/False) the observation belongs to the category. Here is an example:

Lets say we have some categorical data that we want to interact with numerically. Below you will see the key column that contains a,b,c. Lets generate dummy variables for this data.

df = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],
                   "data1": range(6)})
df
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
dummies = pd.get_dummies(df["key"])
dummies
a b c
0 False True False
1 False True False
2 True False False
3 False False True
4 True False False
5 False True False
dummies = pd.get_dummies(df["key"],dtype=float)
dummies
a b c
0 0.0 1.0 0.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
5 0.0 1.0 0.0
dummies = pd.get_dummies(df["key"],dtype=int)
dummies
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
# Now add this to the original data frame
df_new = df.join(dummies)
df_new
key data1 a b c
0 b 0 0 1 0
1 b 1 0 1 0
2 a 2 1 0 0
3 c 3 0 0 1
4 a 4 1 0 0
5 b 5 0 1 0

Now we have three new columns each of which represents a true/false for whether or not the data had that key.

Here is a slightly more complicated example.

These files contain 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000. Thanks to Shyong Lam and Jon Herlocker for cleaning up and generating the data set. See README for more information

mnames = ["movie_id", "title", "genres"]
movies = pd.read_table("data/movielens/movies.dat", sep="::",
                       header=None, names=mnames, engine="python")
movies[:10]
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children's
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller

Notice how we have a bunch of genres. Maybe we want to break this up into categories that can be used mathematically - meanings we need numbers instead of words. The pandas .get_dummies() command can help us here!

First lets get the counts for the different category combinations:

movies['genres'].value_counts()
genres
Drama                              843
Comedy                             521
Horror                             178
Comedy|Drama                       162
Comedy|Romance                     142
                                  ... 
Drama|Film-Noir                      1
Comedy|Horror|Sci-Fi                 1
Adventure|Drama|Romance|Sci-Fi       1
Adventure|Animation|Sci-Fi           1
Adventure|Crime|Sci-Fi|Thriller      1
Name: count, Length: 301, dtype: int64

Notice how there are 301 different categories. Why so many? Well if we look at the names, some movies are part of multiple categories and those categories are split up by the | character. How could we find out the number of Genres?

# I will do this by writing a quick for loop
genre_set = set()
for g in movies['genres']:
    for n in g.split('|'):
        genre_set.add(n)

print(len(genre_set))
genre_set
18
{'Action',
 'Adventure',
 'Animation',
 "Children's",
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western'}

Okay so really there are 18 total genres. Let’s use pandas to break this into dummy variables.

dummies = movies['genres'].str.get_dummies("|")
dummies
Action Adventure Animation Children's Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
3 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3878 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3879 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
3880 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
3881 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
3882 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0

3883 rows × 18 columns

Notice that this looks at the ‘genres’ element, breaks up the string by the | character, and then assigns it a 1 in any category that it belongs to. If we look at the first row, which represents Toy Story, we can see this movie belongs to the categories: Animation, Children’s, Comedy.

Let’s add these dummy variables to our original DataFrame

# The .add_prefix() function adds a name to the beginning of the column
# This helps specify which variable the dummy is representing
movies_new = movies.join(dummies.add_prefix('genre_'))
movies_new
movie_id title genres genre_Action genre_Adventure genre_Animation genre_Children's genre_Comedy genre_Crime genre_Documentary ... genre_Fantasy genre_Film-Noir genre_Horror genre_Musical genre_Mystery genre_Romance genre_Sci-Fi genre_Thriller genre_War genre_Western
0 1 Toy Story (1995) Animation|Children's|Comedy 0 0 1 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 2 Jumanji (1995) Adventure|Children's|Fantasy 0 1 0 1 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
2 3 Grumpier Old Men (1995) Comedy|Romance 0 0 0 0 1 0 0 ... 0 0 0 0 0 1 0 0 0 0
3 4 Waiting to Exhale (1995) Comedy|Drama 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 5 Father of the Bride Part II (1995) Comedy 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3878 3948 Meet the Parents (2000) Comedy 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
3879 3949 Requiem for a Dream (2000) Drama 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3880 3950 Tigerland (2000) Drama 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3881 3951 Two Family House (2000) Drama 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3882 3952 Contender, The (2000) Drama|Thriller 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

3883 rows × 21 columns

String Manipulation

One of the most common things you will have to do is interact with data that is formatted as strings (words). This can happen because of the way you saved your data, maybe everything got turned into strings, or because of the way the data was originally entered.

  1. Simple strings to numbers
  2. Object methods

Anything that is entered into python with quotes or read in as a string will be assumed to be a string. Even though the number below is clearly, to our human minds, a number, Python sees it as a word.

number = '3'
number
'3'
# We can change it into an integer
int(number)
3
# We can change it into a float
float(number)
3.0
# We can change a number back into a string
str(3.0)
'3.0'

If we start with a more complicated string there are lots of things we can do to alter it. The .split() function can split up a string how ever you want! It turns the string into a list of substrings.

val = "a,b,  guido"
val.split(",")
['a', 'b', '  guido']
val.split('  ')
['a,b,', 'guido']
# Notice how the extra whitespace is annoying here!
string_list = val.split(',')
print(string_list)

# The .strip() will get rid of any white space
# Here is an example of a for loop inside a list
# This is called list comprehension
new_string_list = [x.strip() for x in string_list]
print(new_string_list)
['a', 'b', '  guido']
['a', 'b', 'guido']
# Maybe we then want to put these pieces together
# the .join() command will cycle through the list and join up the strings
# with the given string
'.'.join(new_string_list)
'a.b.guido'
# You can check for membership
print('guido' in new_string_list)
print('g' in 'guido')
True
True
# You can find things inside a string
my_string = 'Hello World'
print(my_string.find('W'))
my_string[6]
6
'W'

You Try

Break the following string up into a list of strings using string manipulation functions. See if you can create a list like this:

['Joanna','Bieri','Redlands','Keep up the good work!']

try to get all the capitals and spacing correct!

NOTE - lots of different processes will result in this final list, there is not one right way to do this!

a_string = 'joanna_bieri@redlands.edu says:   Keep up the good work!'
# Your code here

There are so many different string methods! Here are some of my favorites:

  • .replace(old text, new text) replace text
  • .rstrip() strip from the right end of the string
  • .lstrip() strip from the left end of the string
  • .lower() make the string all lower case
  • .upper() make the string all upper case
  • .title() capitalize each first letter

For more advanced string manipulation you can use regex.

import re

look in the book or online for more information.

Many of the string methods are also implemented directly in pandas and can be applied directly to Series data.

Categorical Data

Pandas has a build in data type called Categorical. This helps encode certain columns or parts of your data as being categorical, rather than just an assortment of strings. Categorical has the advantage of being better for memory and for sorting and organizing data. Here is an example:

# Simulate a column with repeated strings
# Do not turn the data into categorical - keep it as strings
n = 1_000_000
df = pd.DataFrame({
    'city': np.random.choice(['New York', 'Los Angeles', 'Chicago'], size=n)
})

df
city
0 New York
1 Los Angeles
2 New York
3 New York
4 New York
... ...
999995 New York
999996 Chicago
999997 New York
999998 New York
999999 New York

1000000 rows × 1 columns

# Now create a column with the same data, but tell pandas it is categorical
df['city_cat'] = df['city'].astype('category')
df
city city_cat
0 New York New York
1 Los Angeles Los Angeles
2 New York New York
3 New York New York
4 New York New York
... ... ...
999995 New York New York
999996 Chicago Chicago
999997 New York New York
999998 New York New York
999999 New York New York

1000000 rows × 2 columns

# Look at the memory usage of each column
print(df['city'].memory_usage(deep=True))
print(df['city_cat'].memory_usage(deep=True))
57667564
1000413
# Lets get the list of cities and order them reverse alphabetically
levels = list(df['city'].value_counts().keys().sort_values(ascending=False))
levels
['New York', 'Los Angeles', 'Chicago']
# Now create a categorical data column that includes the levels as categories
# and tells pandas that the categories are ordered.
df['city_cat_levels'] = pd.Categorical(df['city'], categories=levels, ordered=True)
df
city city_cat city_cat_levels
0 New York New York New York
1 Los Angeles Los Angeles Los Angeles
2 New York New York New York
3 New York New York New York
4 New York New York New York
... ... ... ...
999995 New York New York New York
999996 Chicago Chicago Chicago
999997 New York New York New York
999998 New York New York New York
999999 New York New York New York

1000000 rows × 3 columns

# Now because we called our city_cat_levels column categorical and assigned levels
# Now we can compare using < and >
# And assign an ordering to our data
print(df['city_cat_levels'] > 'Los Angeles')  # True for 'Chicago'
0         False
1         False
2         False
3         False
4         False
          ...  
999995    False
999996     True
999997    False
999998    False
999999    False
Name: city_cat_levels, Length: 1000000, dtype: bool
# The standard comparison only gives you the option to check alphabetically
print(df['city'] > 'Los Angeles')
0          True
1         False
2          True
3          True
4          True
          ...  
999995     True
999996    False
999997     True
999998     True
999999     True
Name: city, Length: 1000000, dtype: bool