Data Wrangling is the art of managing data that might be spread across many files or databases. It also involved organizing data that comes to you in an inconvenient format. We are going to explore some ways that Pandas can help us in organizing our data!
# Some basic package importsimport osimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport plotly.express as pxfrom plotly.subplots import make_subplotsimport plotly.io as piopio.renderers.defaule ='colab'
Hierarchical Indexing
Hierarchical indexing is a feature of pandas that allows you to have more than one index on a single axis in a DataFrame. This is like working with data in a table but allowing it to be higher dimensional. Here is the example from our book:
# This series is a list of listsdata = pd.Series(np.random.uniform(size=9), index=[["a", "a", "a", "b", "b", "c", "c", "d", "d"], [1, 2, 3, 1, 3, 1, 2, 2, 3]])data
a 1 0.236333
2 0.414026
3 0.846063
b 1 0.032594
3 0.883162
c 1 0.938927
2 0.266740
d 2 0.873309
3 0.767932
dtype: float64
Notice here that the index now has two dimensions. You can reach in a grab the stuff inside the ‘a’ grouping and then within that choose 1,2,3. We can use partial indexing to grab different parts of the data.
# To get a specific entry you select both index valuesdata['a',1]
np.float64(0.23633307419404725)
# You can grab all of an outer indexdata['a']
1 0.236333
2 0.414026
3 0.846063
dtype: float64
# You can grab multiple outer index valuesdata[['a','c']]
a 1 0.236333
2 0.414026
3 0.846063
c 1 0.938927
2 0.266740
dtype: float64
# You can select from an inner level index with a : meaning all of the outer index valuesdata.loc[:,1]
a 0.236333
b 0.032594
c 0.938927
dtype: float64
Let’s see what happens when we put this data into a DataFrame. Notice how we have only one column and hierarchical indexing on the left side.
pd.DataFrame(data)
0
a
1
0.236333
2
0.414026
3
0.846063
b
1
0.032594
3
0.883162
c
1
0.938927
2
0.266740
d
2
0.873309
3
0.767932
We can use the .unstack() function to send this data into a one dimensional data frame. Unstack takes the inner index and sends it to to separate columns, which keeping the outer index as the DataFrame index.
data.unstack()
1
2
3
a
0.236333
0.414026
0.846063
b
0.032594
NaN
0.883162
c
0.938927
0.266740
NaN
d
NaN
0.873309
0.767932
You can also .stack() data in a one dimensional data frame.
data.unstack().stack()
a 1 0.236333
2 0.414026
3 0.846063
b 1 0.032594
3 0.883162
c 1 0.938927
2 0.266740
d 2 0.873309
3 0.767932
dtype: float64
In a DataFrame either the columns or the rows can have hierarchical levels.
What does frame.unstack() do in this case. Go ahead and run the command and see if you can understand the results.
# Code and/or words here
Reordering and Sorting Levels
In the example above the levels were state then color and a/b then number. So it is easy to select by the outer level, and a bit harder to get the inside levels.
frame['Ohio']
color
Green
Red
key1
key2
a
1
0
1
2
3
4
b
1
6
7
2
9
10
# This would give a key error because it expects outer level first# frame['Green']
So if we wanted to look at just the green data, we would need to reorder the levels, making color outer and state inner.
How would you swap the index keys? See if you can swap key1 and key2 in the new_frame.
# Your code here
Summary Statistics by Level
If we run statistics on the DataFrame as a whole, it ignores the leves and does the operation to the whole thing. This might be okay in some instances, but in the data frame above maybe we want to sum the “a” and “b” groupings separately.
# Good old fashioned sumframe.sum()
state color
Ohio Green 18
Red 22
Colorado Green 26
dtype: int64
# Grouped sumframe.groupby(level='key1').sum()
state
Ohio
Colorado
color
Green
Red
Green
key1
a
3
5
7
b
15
17
19
The grouped sum lets us look at the sums of the specific index levels. We could also group by the columns! If we just transpose the data frame then the column names become the index names!
frame.T
key1
a
b
key2
1
2
1
2
state
color
Ohio
Green
0
3
6
9
Red
1
4
7
10
Colorado
Green
2
5
8
11
frame.T.groupby(level="color").sum()
key1
a
b
key2
1
2
1
2
color
Green
2
8
14
20
Red
1
4
7
10
Combining and Merging Datasets
Sometimes in your analysis you will want to grab data from more than one file, or maybe you scrape data from more that one website. In these cases you need to be able to merge the data into a single dataset for analysis. There are a few great Pandas commands for this:
pd.merge() - connects the rows in separate DataFrames based on one or more keys. It implements the database join operations.
pd.concat() - this is short of concatenate. Concatenate stacks objects along an access. For example stacking rows to add more observations or stacking columns to add more variables to the existing observations.
combine_first() - splices together overlapping data to fill in missing values in one object with values from another.
Pandas Merge
Merge connects separate DataFrames based on comparing keys (or column labels). There are different merge types available:
inner is the most restrictive and only includes cases where the keys match across both datasets.
left includes all entries in the left dataset and only those that match from the right dataset.
right includes all entries in the right dataset and only those that match from the left dataset.
outer includes all entries in both datasets.
We will start with two example DataFrames and explore the results for the different merge types.
employees = pd.DataFrame({'employee_id': [1, 2, 3, 4, 5],'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],'department_id': [10, 20, 10, 30, 99] # Eve is in unknown dept 99})# Departments DataFrame (has an extra dept with no employees)departments = pd.DataFrame({'department_id': [10, 20, 30, 40],'department_name': ['Engineering', 'HR', 'Marketing', 'Sales'] # Dept 40 has no employees})display(employees)display(departments)
employee_id
name
department_id
0
1
Alice
10
1
2
Bob
20
2
3
Charlie
10
3
4
Diana
30
4
5
Eve
99
department_id
department_name
0
10
Engineering
1
20
HR
2
30
Marketing
3
40
Sales
Looking at these two data sets we see that there are 5 employees and 4 departments. The two DataFrames share the key department_id. You need a shared key or shared data to merge! You will also notice some missmatch between the datasets. For example none of our employees have the department_id 40=Sales, and one of our employees has a department_id 99, which does not appear in our departments data frame.
Lets look at the merges below. Note: we are using employees and the left and departments as the right dataset. This could be switched. Both datasets have the department_id key!
# Inner - THIS IS DEFAULT if you don't choose how# What data is missing? Do you see any Nan?pd.merge(employees, departments, on='department_id', how='inner')
employee_id
name
department_id
department_name
0
1
Alice
10
Engineering
1
2
Bob
20
HR
2
3
Charlie
10
Engineering
3
4
Diana
30
Marketing
# Left# What data is missing? Do you see any Nan?pd.merge(employees, departments, on='department_id', how='left')
employee_id
name
department_id
department_name
0
1
Alice
10
Engineering
1
2
Bob
20
HR
2
3
Charlie
10
Engineering
3
4
Diana
30
Marketing
4
5
Eve
99
NaN
# Right# What data is missing? Do you see any Nan?pd.merge(employees, departments, on='department_id', how='right')
employee_id
name
department_id
department_name
0
1.0
Alice
10
Engineering
1
3.0
Charlie
10
Engineering
2
2.0
Bob
20
HR
3
4.0
Diana
30
Marketing
4
NaN
NaN
40
Sales
# Outer# What data is missing? Do you see any Nan?pd.merge(employees, departments, on='department_id', how='outer')
employee_id
name
department_id
department_name
0
1.0
Alice
10
Engineering
1
3.0
Charlie
10
Engineering
2
2.0
Bob
20
HR
3
4.0
Diana
30
Marketing
4
NaN
NaN
40
Sales
5
5.0
Eve
99
NaN
Merge Type
Includes All Employees
Includes All Departments
Notes
Inner
❌ Only matched
❌ Only matched
Most restrictive
Left
✅ Yes
❌ Only matched
Focus on employees
Right
❌ Only matched
✅ Yes
Focus on departments
Outer
✅ Yes
✅ Yes
Full outer view
How would you merge data sets if one had the correct data, but did not have the correct key? Well, one option would be to change the column labels to match, but you could also tell pandas.merge() two different keys.
You will notice that when the merge happens it resets the index. If you want to preserve the index values for each you may want to store the index values as a column in the data set. This will allow you to look up the original values in the separate data sets if needed.
Sometimes you need to use index values to do your merge. Maybe the matching data is in the index instead of a separate column.
Merge the following data sets using all for ways: inner, left, right, and outer. See if you can predict before running the code what the output will be!
When you need to stack new rows or columns onto an existing data set pd.concat() is a great way to do that.
Let’s imagine that we are working with the employee and department information above. Now suddenly HR sends us information about two new employees and data that contains all the salaries. They have confirmed that the salaries are in increasing order of the employee id. How do we get all this data into a single dataframe?
Concat the new_hires onto the employees data
Merge the employees and department data - keeping all information
Concat the new columns onto the full data set.
When using concat the dimensions must match!
axis=0 must have the same number of columns - you are adding rows
axis=1 must have the same number of rows - you are adding columns
# Concat the new hires# Here we ignore the old index values and reindex so the rows are 0-6all_employees = pd.concat([employees, new_hires],ignore_index=True)display(all_employees)
employee_id
name
department_id
0
1
Alice
10
1
2
Bob
20
2
3
Charlie
10
3
4
Diana
30
4
5
Eve
99
5
6
Joanna
30
6
7
Bella
20
# Merge the department dataall_employees = pd.merge(all_employees,departments,left_on='department_id',right_index=True,how='outer')display(all_employees)
employee_id
name
department_id
department_name
0
1
Alice
10
Engineering
2
3
Charlie
10
Engineering
1
2
Bob
20
HR
6
7
Bella
20
HR
3
4
Diana
30
Marketing
5
6
Joanna
30
Marketing
4
5
Eve
99
NaN
# Concat the salariesfull_data = pd.concat([all_employees, salaries], axis=1)display(full_data)
employee_id
name
department_id
department_name
emp_num
salary
0
1
Alice
10
Engineering
emp_1
60000
2
3
Charlie
10
Engineering
emp_3
62000
1
2
Bob
20
HR
emp_2
55000
6
7
Bella
20
HR
emp_7
40000
3
4
Diana
30
Marketing
emp_4
58000
5
6
Joanna
30
Marketing
emp_6
40000
4
5
Eve
99
NaN
emp_5
500000
Combining Data with Overlap
Sometimes you have two datasets that have an overlap, but one or both of them are incomplete and you want to use one to fill in NaNs in the other. You can think of the comnbine_first() operation as patching up the data. It basically does and if-else statement that inserts values if there are null values in the original dataset.
Here is a scenario where maybe you have incomplete employee data. However each of the datasets has missing data and you want one complete dataset.
combine_first() does not overwrite existing data in the first DataFrame. It aligns on both index and column names — mismatches are handled gracefully. You can think of it as a data patching tool.
combine_first() is perfect when:
You have a primary source of data that may be incomplete.
You have a secondary or backup source you want to use to fill in the blanks.
You want row-wise alignment based on index.
But what happens if the indexes dont align?
Reshaping and Pivoting
The goal with reshaping and pivoting is to rearrange tabular data in a way that is better for your specific analysis. We have already seen some of these commands.
stack() rotates or pivots from the columns in the data to the rows when provided hierarchical indexes.
unstack() rotates of pivots from the rows into the columns creating hierarchical indexes.
.pivot() reshapes the data from long(tall) format to a wide format.
melt() reshapes the data from wide to long(tall) melting the column names into the data
Long (Tall) Format Also called “tidy” data in some contexts:
One row per observation.
Repeated categories or measurements in a single column.
More rows, fewer columns.
One row per person per test
Wide Format
Unique values in a categorical column become column headers.
More columns, fewer rows.
Easier for humans to read, but not always ideal for statistical analysis.
One row per person, one column per test
Below we will look at the ways we might rearrange or reshape our data!
# You can specify more that one value to be added as hierarchical columnspivoted = df.pivot(index='person',columns='month', values=['sales','expenses'])display(pivoted)
sales
expenses
month
Feb
Jan
Feb
Jan
person
Alice
180
200
120
150
Bob
190
210
140
160
You Try
Do a pivot on your merged animal data. You can decide how to pivot, but try to say before running the code what you expect to happen.
# Your prediction here
# Your code here
Stack
Stack takes the index values and shifts from wide to tall format. You will see one grouping of data for each index. In many cases you might want to set the index values to be more interesting to get a better breakdown of the data.
We go from wide data to tall data, meaning one observation per index per column.
# Stack without setting the indexstacked = df.stack()display(stacked)
0 person Alice
month Jan
sales 200
expenses 150
1 person Alice
month Feb
sales 180
expenses 120
2 person Bob
month Jan
sales 210
expenses 160
3 person Bob
month Feb
sales 190
expenses 140
dtype: object
# Stack with index setstacked = df.set_index(['person', 'month']).stack()display(stacked)
person month
Alice Jan sales 200
expenses 150
Feb sales 180
expenses 120
Bob Jan sales 210
expenses 160
Feb sales 190
expenses 140
dtype: int64
# Adding the .reset_index() command to a series will send it to a dataframe again# notice how the stacking is preserved in the row order.stacked = df.set_index(['person', 'month']).stack().reset_index()display(stacked)
person
month
level_2
0
0
Alice
Jan
sales
200
1
Alice
Jan
expenses
150
2
Alice
Feb
sales
180
3
Alice
Feb
expenses
120
4
Bob
Jan
sales
210
5
Bob
Jan
expenses
160
6
Bob
Feb
sales
190
7
Bob
Feb
expenses
140
Unstack
Lets say you are given data with observations for each person. But what you want is a wide data frame, with fewer rows and more categorical columns. This is what unstack can do!
Let’s start with some stacked data - hierarchical indexes
person month
Alice Jan sales 200
expenses 150
Feb sales 180
expenses 120
Bob Jan sales 210
expenses 160
Feb sales 190
expenses 140
dtype: int64
# This will compress the month data (inner index) into a categorical column.stacked.unstack()
sales
expenses
person
month
Alice
Feb
180
120
Jan
200
150
Bob
Feb
190
140
Jan
210
160
# We can specify which level to use# Here use the Person level index as the columnsstacked.unstack(level=0)
person
Alice
Bob
month
Feb
sales
180
190
expenses
120
140
Jan
sales
200
210
expenses
150
160
# Here use the month level index as the columnsstacked.unstack(level=1)
month
Feb
Jan
person
Alice
sales
180
200
expenses
120
150
Bob
sales
190
210
expenses
140
160
# Here use the innermost values as the columnsstacked.unstack(level=2)
sales
expenses
person
month
Alice
Feb
180
120
Jan
200
150
Bob
Feb
190
140
Jan
210
160
Melt
The melt command lets you choose a column (or use all columns) to be used as an additional row in the data.
df
person
month
sales
expenses
0
Alice
Jan
200
150
1
Alice
Feb
180
120
2
Bob
Jan
210
160
3
Bob
Feb
190
140
# Use all the columns - not so useful in this case!pd.melt(df)
variable
value
0
person
Alice
1
person
Alice
2
person
Bob
3
person
Bob
4
month
Jan
5
month
Feb
6
month
Jan
7
month
Feb
8
sales
200
9
sales
180
10
sales
210
11
sales
190
12
expenses
150
13
expenses
120
14
expenses
160
15
expenses
140
# Melt to create new rows for each observation for each person each month# You keep the columns 'person' and 'month' the rest are meltedpd.melt(df,id_vars=['person','month'])