Introduction to Data Science

Exam1 Python Commands Cheat Sheet

Author

Joanna Bieri
DATA101

Basic Imports

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

Loading Data using Pandas:

pd.read_csv()
pd.read_excell()
pd.read_html()
pd.read_json()
pd.read_table()

Inside the parenthesis you put the exact name/location of the data.

Code Examples

Loading Data from a Website with Tables

my_website = "https://en.wikipedia.org/wiki/List_of_Academy_Award%E2%80%93winning_films"
DF = pd.read_html(my_website)

Loading Data from a File on My computer

DF = pd.read_csv('mapdataall.csv')

Loading Data from a File on Joanna’s Website - this code is usually given to you

file_location = 'https://joannabieri.com/introdatascience/data/starwars.csv'
DF = pd.read_csv(file_location)

Data Frame Basics

.shape - find the rows and columns
.columns - get the column names
.describe - do some basic statistics
show() - print the data frame in a fancy way

Assume our data is stored in the variable DF.

Code Examples

Getting Rows = Observations and Columns = Variables

DF.shape

Get the Column Names

DF.columns

Use describe for basic statistics

DF.describe()

Show the Data in Pretty Format

show(DF)

Selecting Columns

Choose the columns you care about
Select those columns using DF[ ]
The list of columns goes between the square brackets

Code Examples

Select Just One Column

DF['name']

Select More than one Column

my_columns = ['name','hair_color', 'skin_color', 'eye_color']
DF[my_columns]

Plots - Plotly Express

px.histogram()
px.scatterplot()
px.box()
px.density_heatmap()
px.bar()

All plots have three basic parts:

Define the Basic Data and Plot Type

fig = px.histogram(DF,
                    x='column name')

Add some fancy stuff

fig.update_layout(title=' ',
                  title_x=0.5,
                  xaxis_title=" ",
                  yaxis_title=" ",
                  autosize=False,
                  width=800,
                  height=500)

Show the figure

fig.show()

For the examples below we are using the Starwars Data

show(DF)

name	height	mass	hair_color	skin_color	eye_color	birth_year	sex	gender	homeworld	species	films	vehicles	starships
Loading ITables v2.1.4 from the internet... (need help?)

Code Examples

Scatter Plot

fig = px.scatter(DF,
                 x='height',
                 y='mass',
                 hover_data='name',
                 color = 'sex')

fig.update_layout(title='Mass vs. Height of Starwars Characters',
                  title_x=0.5,
                  xaxis_title="Mass",
                  yaxis_title="Height",
                  autosize=False,
                  width=800,
                  height=500)

fig.show()

Histogram - colored by column

# Scatter Plot
fig = px.histogram(DF,
                 x='height',
                 nbins=10,
                 color = 'sex')

fig.update_layout(bargap=0.02,
                  title='Mass of Starwars Characters',
                  title_x=0.5,
                  yaxis_title="Frequency",
                  xaxis_title="Height",
                  autosize=False,
                  width=800,
                  height=500)

fig.show()

Histogram - separated by facet column

# Scatter Plot
fig = px.histogram(DF,
                 x='height',
                 nbins=10,
                 facet_col= 'sex',
                 facet_col_wrap=2,
                  color='eye_color')

fig.update_layout(bargap=0.02,
                  title='Mass of Starwars Characters',
                  title_x=0.5,
                  xaxis_title="Frequency",
                  yaxis_title="Height",
                  autosize=False,
                  width=800,
                  height=800)

fig.show()

Categorical Data - value_counts and unique entries

DF[[columns of interest]].value_counts()
DF[[columns of interest]].unique()
DF[[columns of interest]].drop_duplicates()

All of these commands will give you information about how many categories are in a column (or columns).

Code Examples

Get the unique categories and count frequency

show(DF['sex'].value_counts())

Just get the unique values - remove duplicates

show(DF['sex'].drop_duplicates())

More than one column - this will return the unique combinations

my_columns = ['sex','eye_color']
DF[my_columns].drop_duplicates()

Sorting

DF.sort_values(by=‘sorting column name’)
DF.sort_values(by=‘sorting column name’,ascending=False)

You can technically leave out the by= command, but this sometimes helps me to understand the command. If you wanted to send in a list of columns, it can sort more than one column at a time. It will sort in the order the columns are given.

Code Examples

Sorting by height descending

DF.sort_values(by='height',ascending=False)

Boolean Masks

Basic Operators

Operator	Definition
<	less than
>	greater than
<=	less than or equal to
>=	greater than or equal to
==	exactly equal to
!=	not equal to

Advanced Operators

Operator	Definition
and	check if two things are both true
or	check if one of two things is true
in	checks if something is in another thing
!	not checks if something is false

Operators that Combine Masks

Operator	Definition
&	checks that two masks are both true

when masking more than one condition, separate the conditions with parenthesis:

mask = (condition 1) & (condition 2) & (condition 3)
DF[mask]

Code Examples

Mask for a value not equal

mask = DF['homeworld'] != 'Tatooine'
DF[mask]

Mask for a value equal

mask = DF['homeworld'] == 'Tatooine'
DF[mask]

Mask a data frame and save the information in a new variable

mask = DF['height'] > 100
DF_tall = DF[mask].copy()

Masking more than one thing

mask = (DF['homeworld'] == 'Tatooine') & (DF['homeworld'] == 'Naboo') & (DF['homeworld'] == 'Alderaan')
DF_homeworld = DF[mask].copy()

Adding a Column to a Data Frame

To add a column you just enter a new column name into DF[] followed by a calculation

**DF[‘new_column_name’] = enter your calculation here

Code Examples

New column that adds two other columns

DF['height_and_mass'] = DF['height'] + DF['mass']

New column that keeps track of boolean mask

DF['is_human_female'] = (DF['species']=='Human') & (DF['sex']=='female')

Grouping Data by a Categorical Column

.groupby(‘column name’).function()

The groupby command just separates the data in Pythons memory. Then you apply a function to see the results. In this class we have been doing

my_columns = [list of columns] # must include the column you are grouping by
DF[my_columns).groupby('grouping column').describe()

You can use all sorts of functions at the end:

.min()
.max()
.mean()
.median()
.sum()
.prod()
.count()
.describe()

if you just want to see one group you can use

.get_group()

Code Examples

Group by one column do statistics on one other column

my_columns = ['sex','height']
DF[my_columns].groupby('sex').describe()

Group by one column and then look at one of the groups.

my_columns = ['sex','height']
DF[my_columns].groupby('sex').get_group('female')

Group by one column and then find the average of two other columns.

my_columns = ['sex','height','mass']
DF[my_columns].groupby('sex').mean()

Joining Data Frames

If you ahave two or more data frames that share a column you can join them together

pd.merge(DataFrame1, DataFrame2, on=‘shared column’, how=???)
- how = “left” - DataFrame1 observations are used to set the shared column in the new DataFrame
- how = “right” - DataFrame2 observations are used to set the shared column in the new DataFrame
- how = “outer” - All possible observations are used to set the shared column in the new DataFrame
- how = “inner” - Only observations that are in BOTH DataFrames are used to set the shared column in the new DataFrame

Code Examples - using fake data

Make the data

DF_fake1 = pd.DataFrame({'id':[1,2,3],'data1':['x1','x2','x3']})
DF_fake2 = pd.DataFrame({'id':[1,2,4],'data1':['y1','y2','y4']})

show(DF_fake1)
show(DF_fake2)

id	data1
Loading ITables v2.1.4 from the internet... (need help?)

id	data1
Loading ITables v2.1.4 from the internet... (need help?)

Merge using how=left - results will have ids = 1,2,3

pd.merge(DF_fake1, DF_fake2, on='id',how='left')

	id	data1_x	data1_y
0	1	x1	y1
1	2	x2	y2
2	3	x3	NaN

Merge using how=right - results will have ids = 1,2,4

pd.merge(DF_fake1, DF_fake2, on='id',how='right')

	id	data1_x	data1_y
0	1	x1	y1
1	2	x2	y2
2	4	NaN	y4

Merge using how=outer - results will have ids = 1,2,3,4

pd.merge(DF_fake1, DF_fake2, on='id',how='outer')

	id	data1_x	data1_y
0	1	x1	y1
1	2	x2	y2
2	3	x3	NaN
3	4	NaN	y4

Merge using how=inner - results will have ids = 1,2

pd.merge(DF_fake1, DF_fake2, on='id',how='inner')

	id	data1_x	data1_y
0	1	x1	y1
1	2	x2	y2

Using .appy and lambda x: to fix your data

DF[‘column name’].apply(lambda x: operation on x)
- The .apply() command lets you apply a function to every row along your data frame.
- lambda is a special function you first write the word lambda and choose a variable followed by a colon.

In the example below we are going to create a column in our data frame that counts the total number of films that a Starwars character has been in

Code Examples

Use apply and lambda to break up string data

DF['number_films'] = DF['films'].apply(lambda x: len(x.split(',')))
''' 
Why does this work???

Here we are looking at each piece of data in the flims column.
The data looks like this: 
"A New Hope, The Empire Strikes Back, Return of the Jedi, Revenge of the Sith"

When we do x.split(',') this returns a list: 
["A New Hope", "The Empire Strikes Back", "Return of the Jedi", "Revenge of the Sith"]

We find the length of this list to count up the movies!

'''
DF[['name','number_films']]

Use apply and lambda to convert weight from kg to lbs

DF['weight'] = DF['mass'].apply(lambda x: x*2.20462)