Intermediate Data Science

Data Wrangling: Join, Combine, Reshape

Author

Joanna Bieri
DATA201

Intermediate Data Science

Important Information

Data Wrangling

Data Wrangling is the art of managing data that might be spread across many files or databases. It also involved organizing data that comes to you in an inconvenient format. We are going to explore some ways that Pandas can help us in organizing our data!

# Some basic package imports
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

Hierarchical Indexing

Hierarchical indexing is a feature of pandas that allows you to have more than one index on a single axis in a DataFrame. This is like working with data in a table but allowing it to be higher dimensional. Here is the example from our book:

# This series is a list of lists
data = pd.Series(np.random.uniform(size=9),
                 index=[["a", "a", "a", "b", "b", "c", "c", "d", "d"],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data
a  1    0.236333
   2    0.414026
   3    0.846063
b  1    0.032594
   3    0.883162
c  1    0.938927
   2    0.266740
d  2    0.873309
   3    0.767932
dtype: float64
data.index
MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

Notice here that the index now has two dimensions. You can reach in a grab the stuff inside the ‘a’ grouping and then within that choose 1,2,3. We can use partial indexing to grab different parts of the data.

# To get a specific entry you select both index values
data['a',1]
np.float64(0.23633307419404725)
# You can grab all of an outer index
data['a']
1    0.236333
2    0.414026
3    0.846063
dtype: float64
# You can grab multiple outer index values
data[['a','c']]
a  1    0.236333
   2    0.414026
   3    0.846063
c  1    0.938927
   2    0.266740
dtype: float64
# You can select from an inner level index with a : meaning all of the outer index values
data.loc[:,1]
a    0.236333
b    0.032594
c    0.938927
dtype: float64

Let’s see what happens when we put this data into a DataFrame. Notice how we have only one column and hierarchical indexing on the left side.

pd.DataFrame(data)
0
a 1 0.236333
2 0.414026
3 0.846063
b 1 0.032594
3 0.883162
c 1 0.938927
2 0.266740
d 2 0.873309
3 0.767932

We can use the .unstack() function to send this data into a one dimensional data frame. Unstack takes the inner index and sends it to to separate columns, which keeping the outer index as the DataFrame index.

data.unstack()
1 2 3
a 0.236333 0.414026 0.846063
b 0.032594 NaN 0.883162
c 0.938927 0.266740 NaN
d NaN 0.873309 0.767932

You can also .stack() data in a one dimensional data frame.

data.unstack().stack()
a  1    0.236333
   2    0.414026
   3    0.846063
b  1    0.032594
   3    0.883162
c  1    0.938927
   2    0.266740
d  2    0.873309
   3    0.767932
dtype: float64

In a DataFrame either the columns or the rows can have hierarchical levels.

frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[["a", "a", "b", "b"], [1, 2, 1, 2]],
                     columns=[["Ohio", "Ohio", "Colorado"],
                              ["Green", "Red", "Green"]])
frame
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
# The keys are two dimensional
frame.keys()
MultiIndex([(    'Ohio', 'Green'),
            (    'Ohio',   'Red'),
            ('Colorado', 'Green')],
           )
# The indices are two dimensional
frame.index
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )
# How do I get data out of this thing?
frame['Ohio','Green']
a  1    0
   2    3
b  1    6
   2    9
Name: (Ohio, Green), dtype: int64
frame['Ohio','Green'].loc['a',2]
np.int64(3)
# Right now the columns don't have names
frame.columns.names
FrozenList([None, None])
# And the index labels don't have names
frame.index.names
FrozenList([None, None])
# Lets add some names
frame.index.names = ["key1", "key2"]
frame.columns.names = ["state", "color"]
frame
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11

You try

What does frame.unstack() do in this case. Go ahead and run the command and see if you can understand the results.

# Code and/or words here

Reordering and Sorting Levels

In the example above the levels were state then color and a/b then number. So it is easy to select by the outer level, and a bit harder to get the inside levels.

frame['Ohio']
color Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
# This would give a key error because it expects outer level first
# frame['Green']

So if we wanted to look at just the green data, we would need to reorder the levels, making color outer and state inner.

new_frame = frame.swaplevel('state','color', axis=1)
new_frame
color Green Red Green
state Ohio Ohio Colorado
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
new_frame['Green']
state Ohio Colorado
key1 key2
a 1 0 2
2 3 5
b 1 6 8
2 9 11

You Try

How would you swap the index keys? See if you can swap key1 and key2 in the new_frame.

# Your code here

Summary Statistics by Level

If we run statistics on the DataFrame as a whole, it ignores the leves and does the operation to the whole thing. This might be okay in some instances, but in the data frame above maybe we want to sum the “a” and “b” groupings separately.

# Good old fashioned sum
frame.sum()
state     color
Ohio      Green    18
          Red      22
Colorado  Green    26
dtype: int64
# Grouped sum
frame.groupby(level='key1').sum()
state Ohio Colorado
color Green Red Green
key1
a 3 5 7
b 15 17 19

The grouped sum lets us look at the sums of the specific index levels. We could also group by the columns! If we just transpose the data frame then the column names become the index names!

frame.T
key1 a b
key2 1 2 1 2
state color
Ohio Green 0 3 6 9
Red 1 4 7 10
Colorado Green 2 5 8 11
frame.T.groupby(level="color").sum()
key1 a b
key2 1 2 1 2
color
Green 2 8 14 20
Red 1 4 7 10

Combining and Merging Datasets

Sometimes in your analysis you will want to grab data from more than one file, or maybe you scrape data from more that one website. In these cases you need to be able to merge the data into a single dataset for analysis. There are a few great Pandas commands for this:

  • pd.merge() - connects the rows in separate DataFrames based on one or more keys. It implements the database join operations.
  • pd.concat() - this is short of concatenate. Concatenate stacks objects along an access. For example stacking rows to add more observations or stacking columns to add more variables to the existing observations.
  • combine_first() - splices together overlapping data to fill in missing values in one object with values from another.

Pandas Merge

Merge connects separate DataFrames based on comparing keys (or column labels). There are different merge types available:

  • inner is the most restrictive and only includes cases where the keys match across both datasets.
  • left includes all entries in the left dataset and only those that match from the right dataset.
  • right includes all entries in the right dataset and only those that match from the left dataset.
  • outer includes all entries in both datasets.

We will start with two example DataFrames and explore the results for the different merge types.

employees = pd.DataFrame({
    'employee_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'department_id': [10, 20, 10, 30, 99]  # Eve is in unknown dept 99
})

# Departments DataFrame (has an extra dept with no employees)
departments = pd.DataFrame({
    'department_id': [10, 20, 30, 40],
    'department_name': ['Engineering', 'HR', 'Marketing', 'Sales']  # Dept 40 has no employees
})

display(employees)
display(departments)
employee_id name department_id
0 1 Alice 10
1 2 Bob 20
2 3 Charlie 10
3 4 Diana 30
4 5 Eve 99
department_id department_name
0 10 Engineering
1 20 HR
2 30 Marketing
3 40 Sales

Looking at these two data sets we see that there are 5 employees and 4 departments. The two DataFrames share the key department_id. You need a shared key or shared data to merge! You will also notice some missmatch between the datasets. For example none of our employees have the department_id 40=Sales, and one of our employees has a department_id 99, which does not appear in our departments data frame.

Lets look at the merges below. Note: we are using employees and the left and departments as the right dataset. This could be switched. Both datasets have the department_id key!

# Inner - THIS IS DEFAULT if you don't choose how
# What data is missing? Do you see any Nan?
pd.merge(employees, departments, on='department_id', how='inner')
employee_id name department_id department_name
0 1 Alice 10 Engineering
1 2 Bob 20 HR
2 3 Charlie 10 Engineering
3 4 Diana 30 Marketing
# Left
# What data is missing? Do you see any Nan?
pd.merge(employees, departments, on='department_id', how='left')
employee_id name department_id department_name
0 1 Alice 10 Engineering
1 2 Bob 20 HR
2 3 Charlie 10 Engineering
3 4 Diana 30 Marketing
4 5 Eve 99 NaN
# Right
# What data is missing? Do you see any Nan?
pd.merge(employees, departments, on='department_id', how='right')
employee_id name department_id department_name
0 1.0 Alice 10 Engineering
1 3.0 Charlie 10 Engineering
2 2.0 Bob 20 HR
3 4.0 Diana 30 Marketing
4 NaN NaN 40 Sales
# Outer
# What data is missing? Do you see any Nan?
pd.merge(employees, departments, on='department_id', how='outer')
employee_id name department_id department_name
0 1.0 Alice 10 Engineering
1 3.0 Charlie 10 Engineering
2 2.0 Bob 20 HR
3 4.0 Diana 30 Marketing
4 NaN NaN 40 Sales
5 5.0 Eve 99 NaN
Merge Type Includes All Employees Includes All Departments Notes
Inner ❌ Only matched ❌ Only matched Most restrictive
Left ✅ Yes ❌ Only matched Focus on employees
Right ❌ Only matched ✅ Yes Focus on departments
Outer ✅ Yes ✅ Yes Full outer view

How would you merge data sets if one had the correct data, but did not have the correct key? Well, one option would be to change the column labels to match, but you could also tell pandas.merge() two different keys.

departments.rename(columns={'department_id':'dept_code'}, inplace=True)
display(departments)
dept_code department_name
0 10 Engineering
1 20 HR
2 30 Marketing
3 40 Sales
pd.merge(employees, departments, left_on='department_id', right_on='dept_code')
employee_id name department_id dept_code department_name
0 1 Alice 10 10 Engineering
1 2 Bob 20 20 HR
2 3 Charlie 10 10 Engineering
3 4 Diana 30 30 Marketing

You will notice that when the merge happens it resets the index. If you want to preserve the index values for each you may want to store the index values as a column in the data set. This will allow you to look up the original values in the separate data sets if needed.

Sometimes you need to use index values to do your merge. Maybe the matching data is in the index instead of a separate column.

departments = pd.DataFrame({
    'department_name': ['Engineering', 'HR', 'Marketing']
}, index=[10, 20, 30])

display(departments)
department_name
10 Engineering
20 HR
30 Marketing
pd.merge(employees,departments, left_on='department_id', right_index=True)
employee_id name department_id department_name
0 1 Alice 10 Engineering
1 2 Bob 20 HR
2 3 Charlie 10 Engineering
3 4 Diana 30 Marketing

You Try

Merge the following data sets using all for ways: inner, left, right, and outer. See if you can predict before running the code what the output will be!

df_animals = pd.DataFrame({
    'animal_id': [1, 2, 3, 4],
    'name': ['Leo', 'Stripes', 'Spot', 'Fluffy'],
    'type': ['Lion', 'Tiger', 'Cheetah', 'Cat']
})

df_habitats = pd.DataFrame({
    'animal_id': [1, 2, 5, 4],
    'habitat': ['Savannah', 'Jungle', 'Mountains', 'Domestic'],
    'population_estimate': [25000, 3200, 120, 50000000]
})

display(df_animals)
display(df_habitats)
animal_id name type
0 1 Leo Lion
1 2 Stripes Tiger
2 3 Spot Cheetah
3 4 Fluffy Cat
animal_id habitat population_estimate
0 1 Savannah 25000
1 2 Jungle 3200
2 5 Mountains 120
3 4 Domestic 50000000
# Your prediction here
# Your code here
# Repeat

Hirearchical index values:

When you have hierarchical index values or columns, things get more confusing my merges are still possible!

left_data = pd.DataFrame({"key1": ["Ohio", "Ohio", "Ohio",
                               "Nevada", "Nevada"],
                      "key2": [2000, 2001, 2002, 2001, 2002],
                      "data": pd.Series(range(5), dtype="Int64")})
right_data_index = pd.MultiIndex.from_arrays(
    [
        ["Nevada", "Nevada", "Ohio", "Ohio", "Ohio", "Ohio"],
        [2001, 2000, 2000, 2000, 2001, 2002]
    ]
)

right_data = pd.DataFrame({"event1": pd.Series([0, 2, 4, 6, 8, 10], dtype="Int64",
                                           index=right_data_index),
                       "event2": pd.Series([1, 3, 5, 7, 9, 11], dtype="Int64",
                                           index=right_data_index)})

display(right_data)
display(left_data)
event1 event2
Nevada 2001 0 1
2000 2 3
Ohio 2000 4 5
2000 6 7
2001 8 9
2002 10 11
key1 key2 data
0 Ohio 2000 0
1 Ohio 2001 1
2 Ohio 2002 2
3 Nevada 2001 3
4 Nevada 2002 4

Now we need to indicate multiple columns to merge on as a list.

pd.merge(left_data, right_data, left_on=["key1", "key2"], right_index=True)
key1 key2 data event1 event2
0 Ohio 2000 0 4 5
0 Ohio 2000 0 6 7
1 Ohio 2001 1 8 9
2 Ohio 2002 2 10 11
3 Nevada 2001 3 0 1

Concatenate

When you need to stack new rows or columns onto an existing data set pd.concat() is a great way to do that.

Let’s imagine that we are working with the employee and department information above. Now suddenly HR sends us information about two new employees and data that contains all the salaries. They have confirmed that the salaries are in increasing order of the employee id. How do we get all this data into a single dataframe?

  1. Concat the new_hires onto the employees data
  2. Merge the employees and department data - keeping all information
  3. Concat the new columns onto the full data set.

When using concat the dimensions must match!

  • axis=0 must have the same number of columns - you are adding rows
  • axis=1 must have the same number of rows - you are adding columns
new_hires = pd.DataFrame({
    'employee_id': [6, 7],
    'name': ['Joanna', 'Bella'],
    'department_id': [30, 20]
})

salaries = pd.DataFrame({
    'emp_num': ['emp_'+str(i+1) for i in range(7)],
    'salary': [60_000, 55_000, 62_000, 58_000, 500_000, 40_000, 40_000]
})

display(new_hires)
display(salaries)
employee_id name department_id
0 6 Joanna 30
1 7 Bella 20
emp_num salary
0 emp_1 60000
1 emp_2 55000
2 emp_3 62000
3 emp_4 58000
4 emp_5 500000
5 emp_6 40000
6 emp_7 40000
# Concat the new hires
# Here we ignore the old index values and reindex so the rows are 0-6
all_employees = pd.concat([employees, new_hires],ignore_index=True)
display(all_employees)
employee_id name department_id
0 1 Alice 10
1 2 Bob 20
2 3 Charlie 10
3 4 Diana 30
4 5 Eve 99
5 6 Joanna 30
6 7 Bella 20
# Merge the department data
all_employees = pd.merge(all_employees,departments,left_on='department_id',right_index=True,how='outer')
display(all_employees)
employee_id name department_id department_name
0 1 Alice 10 Engineering
2 3 Charlie 10 Engineering
1 2 Bob 20 HR
6 7 Bella 20 HR
3 4 Diana 30 Marketing
5 6 Joanna 30 Marketing
4 5 Eve 99 NaN
# Concat the salaries
full_data = pd.concat([all_employees, salaries], axis=1)
display(full_data)
employee_id name department_id department_name emp_num salary
0 1 Alice 10 Engineering emp_1 60000
2 3 Charlie 10 Engineering emp_3 62000
1 2 Bob 20 HR emp_2 55000
6 7 Bella 20 HR emp_7 40000
3 4 Diana 30 Marketing emp_4 58000
5 6 Joanna 30 Marketing emp_6 40000
4 5 Eve 99 NaN emp_5 500000

Combining Data with Overlap

Sometimes you have two datasets that have an overlap, but one or both of them are incomplete and you want to use one to fill in NaNs in the other. You can think of the comnbine_first() operation as patching up the data. It basically does and if-else statement that inserts values if there are null values in the original dataset.

Here is a scenario where maybe you have incomplete employee data. However each of the datasets has missing data and you want one complete dataset.

employee_profiles = pd.DataFrame({
    'name': ['Alice', None, 'Charlie'],
    'dept_code': [10, 20, None],
    'email': [None, 'bob@example.com', None]
}, index=[1, 2, 3])  

backup_profiles = pd.DataFrame({
    'name': ['Alice A.', 'Bob B.', 'Charlie C.', 'Diana D.'],
    'dept_code': [10, 20, 10, 30],
    'email': ['alice@example.com', None, 'charlie@example.com', 'diana@example.com'],
    'phone': ['111-1111', '222-2222', '333-3333', '444-4444']  
}, index=[1, 2, 3, 4]) 

display(employee_profiles)
display(backup_profiles)
name dept_code email
1 Alice 10.0 None
2 None 20.0 bob@example.com
3 Charlie NaN None
name dept_code email phone
1 Alice A. 10 alice@example.com 111-1111
2 Bob B. 20 None 222-2222
3 Charlie C. 10 charlie@example.com 333-3333
4 Diana D. 30 diana@example.com 444-4444
combined_profiles = employee_profiles.combine_first(backup_profiles)
display(combined_profiles)
dept_code email name phone
1 10.0 alice@example.com Alice 111-1111
2 20.0 bob@example.com Bob B. 222-2222
3 10.0 charlie@example.com Charlie 333-3333
4 30.0 diana@example.com Diana D. 444-4444

combine_first() does not overwrite existing data in the first DataFrame. It aligns on both index and column names — mismatches are handled gracefully. You can think of it as a data patching tool.

combine_first() is perfect when:

  • You have a primary source of data that may be incomplete.
  • You have a secondary or backup source you want to use to fill in the blanks.
  • You want row-wise alignment based on index.

But what happens if the indexes dont align?

Reshaping and Pivoting

The goal with reshaping and pivoting is to rearrange tabular data in a way that is better for your specific analysis. We have already seen some of these commands.

  • stack() rotates or pivots from the columns in the data to the rows when provided hierarchical indexes.
  • unstack() rotates of pivots from the rows into the columns creating hierarchical indexes.
  • .pivot() reshapes the data from long(tall) format to a wide format.
  • melt() reshapes the data from wide to long(tall) melting the column names into the data

Long (Tall) Format Also called “tidy” data in some contexts:

  • One row per observation.
  • Repeated categories or measurements in a single column.
  • More rows, fewer columns.
  • One row per person per test

Wide Format

  • Unique values in a categorical column become column headers.
  • More columns, fewer rows.
  • Easier for humans to read, but not always ideal for statistical analysis.
  • One row per person, one column per test

Below we will look at the ways we might rearrange or reshape our data!

df = pd.DataFrame({
    'person': ['Alice', 'Alice', 'Bob', 'Bob'],
    'month': ['Jan', 'Feb', 'Jan', 'Feb'],
    'sales': [200, 180, 210, 190],
    'expenses': [150, 120, 160, 140]
})

display(df)
person month sales expenses
0 Alice Jan 200 150
1 Alice Feb 180 120
2 Bob Jan 210 160
3 Bob Feb 190 140

Pivot

When you pivot a DataFrame you tell it which columns to use for the new data set

  • index which column should be used as the new row (index) labels.
  • column which column should be used as the new column labels.
  • values which column should be used to fill in values in the new data set.
pivoted = df.pivot(index='person',columns='month', values='sales')
display(pivoted)
month Feb Jan
person
Alice 180 200
Bob 190 210
pivoted = df.pivot(index='month',columns='person', values='expenses')
display(pivoted)
person Alice Bob
month
Feb 120 140
Jan 150 160
# You can specify more that one value to be added as hierarchical columns
pivoted = df.pivot(index='person',columns='month', values=['sales','expenses'])
display(pivoted)
sales expenses
month Feb Jan Feb Jan
person
Alice 180 200 120 150
Bob 190 210 140 160

You Try

Do a pivot on your merged animal data. You can decide how to pivot, but try to say before running the code what you expect to happen.

# Your prediction here
# Your code here

Stack

Stack takes the index values and shifts from wide to tall format. You will see one grouping of data for each index. In many cases you might want to set the index values to be more interesting to get a better breakdown of the data.

We go from wide data to tall data, meaning one observation per index per column.

# Stack without setting the index
stacked = df.stack()
display(stacked)
0  person      Alice
   month         Jan
   sales         200
   expenses      150
1  person      Alice
   month         Feb
   sales         180
   expenses      120
2  person        Bob
   month         Jan
   sales         210
   expenses      160
3  person        Bob
   month         Feb
   sales         190
   expenses      140
dtype: object
# Stack with index set
stacked = df.set_index(['person', 'month']).stack()
display(stacked)
person  month          
Alice   Jan    sales       200
               expenses    150
        Feb    sales       180
               expenses    120
Bob     Jan    sales       210
               expenses    160
        Feb    sales       190
               expenses    140
dtype: int64
# Adding the .reset_index() command to a series will send it to a dataframe again
# notice how the stacking is preserved in the row order.
stacked = df.set_index(['person', 'month']).stack().reset_index()
display(stacked)
person month level_2 0
0 Alice Jan sales 200
1 Alice Jan expenses 150
2 Alice Feb sales 180
3 Alice Feb expenses 120
4 Bob Jan sales 210
5 Bob Jan expenses 160
6 Bob Feb sales 190
7 Bob Feb expenses 140

Unstack

Lets say you are given data with observations for each person. But what you want is a wide data frame, with fewer rows and more categorical columns. This is what unstack can do!

Let’s start with some stacked data - hierarchical indexes

stacked = df.set_index(['person','month']).stack()
display(stacked)
person  month          
Alice   Jan    sales       200
               expenses    150
        Feb    sales       180
               expenses    120
Bob     Jan    sales       210
               expenses    160
        Feb    sales       190
               expenses    140
dtype: int64
# This will compress the month data (inner index) into a categorical column.
stacked.unstack()
sales expenses
person month
Alice Feb 180 120
Jan 200 150
Bob Feb 190 140
Jan 210 160
# We can specify which level to use
# Here use the Person level index as the columns
stacked.unstack(level=0)
person Alice Bob
month
Feb sales 180 190
expenses 120 140
Jan sales 200 210
expenses 150 160
# Here use the month level index as the columns
stacked.unstack(level=1)
month Feb Jan
person
Alice sales 180 200
expenses 120 150
Bob sales 190 210
expenses 140 160
# Here use the innermost values as the columns
stacked.unstack(level=2)
sales expenses
person month
Alice Feb 180 120
Jan 200 150
Bob Feb 190 140
Jan 210 160

Melt

The melt command lets you choose a column (or use all columns) to be used as an additional row in the data.

df
person month sales expenses
0 Alice Jan 200 150
1 Alice Feb 180 120
2 Bob Jan 210 160
3 Bob Feb 190 140
# Use all the columns - not so useful in this case!
pd.melt(df)
variable value
0 person Alice
1 person Alice
2 person Bob
3 person Bob
4 month Jan
5 month Feb
6 month Jan
7 month Feb
8 sales 200
9 sales 180
10 sales 210
11 sales 190
12 expenses 150
13 expenses 120
14 expenses 160
15 expenses 140
# Melt to create new rows for each observation for each person each month
# You keep the columns 'person' and 'month' the rest are melted
pd.melt(df,id_vars=['person','month'])
person month variable value
0 Alice Jan sales 200
1 Alice Feb sales 180
2 Bob Jan sales 210
3 Bob Feb sales 190
4 Alice Jan expenses 150
5 Alice Feb expenses 120
6 Bob Jan expenses 160
7 Bob Feb expenses 140