Introduction to Data Science

Exam1 Python Commands Cheat Sheet

Author

Joanna Bieri
DATA101

Basic Imports
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

Loading Data using Pandas:

  • pd.read_csv()
  • pd.read_excell()
  • pd.read_html()
  • pd.read_json()
  • pd.read_table()

Inside the parenthesis you put the exact name/location of the data.

Code Examples

Loading Data from a Website with Tables
my_website = "https://en.wikipedia.org/wiki/List_of_Academy_Award%E2%80%93winning_films"
DF = pd.read_html(my_website)
Loading Data from a File on My computer
DF = pd.read_csv('mapdataall.csv')
Loading Data from a File on Joanna’s Website - this code is usually given to you
file_location = 'https://joannabieri.com/introdatascience/data/starwars.csv'
DF = pd.read_csv(file_location)

Data Frame Basics

  • .shape - find the rows and columns
  • .columns - get the column names
  • .describe - do some basic statistics
  • show() - print the data frame in a fancy way

Assume our data is stored in the variable DF.

Code Examples

Getting Rows = Observations and Columns = Variables
DF.shape
Get the Column Names
DF.columns
Use describe for basic statistics
DF.describe()
Show the Data in Pretty Format
show(DF)

Selecting Columns

  • Choose the columns you care about
  • Select those columns using DF[ ]
  • The list of columns goes between the square brackets

Code Examples

Select Just One Column
DF['name']
Select More than one Column
my_columns = ['name','hair_color', 'skin_color', 'eye_color']
DF[my_columns]

Plots - Plotly Express

  • px.histogram()
  • px.scatterplot()
  • px.box()
  • px.density_heatmap()
  • px.bar()

All plots have three basic parts:


  • Define the Basic Data and Plot Type
fig = px.histogram(DF,
                    x='column name')

  • Add some fancy stuff
fig.update_layout(title=' ',
                  title_x=0.5,
                  xaxis_title=" ",
                  yaxis_title=" ",
                  autosize=False,
                  width=800,
                  height=500)

  • Show the figure
fig.show()

For the examples below we are using the Starwars Data

show(DF)
name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
Loading ITables v2.1.4 from the internet... (need help?)

Code Examples

Scatter Plot
fig = px.scatter(DF,
                 x='height',
                 y='mass',
                 hover_data='name',
                 color = 'sex')

fig.update_layout(title='Mass vs. Height of Starwars Characters',
                  title_x=0.5,
                  xaxis_title="Mass",
                  yaxis_title="Height",
                  autosize=False,
                  width=800,
                  height=500)

fig.show()
Histogram - colored by column
# Scatter Plot
fig = px.histogram(DF,
                 x='height',
                 nbins=10,
                 color = 'sex')

fig.update_layout(bargap=0.02,
                  title='Mass of Starwars Characters',
                  title_x=0.5,
                  yaxis_title="Frequency",
                  xaxis_title="Height",
                  autosize=False,
                  width=800,
                  height=500)

fig.show()
Histogram - separated by facet column
# Scatter Plot
fig = px.histogram(DF,
                 x='height',
                 nbins=10,
                 facet_col= 'sex',
                 facet_col_wrap=2,
                  color='eye_color')

fig.update_layout(bargap=0.02,
                  title='Mass of Starwars Characters',
                  title_x=0.5,
                  xaxis_title="Frequency",
                  yaxis_title="Height",
                  autosize=False,
                  width=800,
                  height=800)

fig.show()

Categorical Data - value_counts and unique entries

  • DF[[columns of interest]].value_counts()
  • DF[[columns of interest]].unique()
  • DF[[columns of interest]].drop_duplicates()

All of these commands will give you information about how many categories are in a column (or columns).

Code Examples

Get the unique categories and count frequency
show(DF['sex'].value_counts())
Just get the unique values - remove duplicates
show(DF['sex'].drop_duplicates())
More than one column - this will return the unique combinations
my_columns = ['sex','eye_color']
DF[my_columns].drop_duplicates()

Sorting

  • DF.sort_values(by=‘sorting column name’)
  • DF.sort_values(by=‘sorting column name’,ascending=False)

You can technically leave out the by= command, but this sometimes helps me to understand the command. If you wanted to send in a list of columns, it can sort more than one column at a time. It will sort in the order the columns are given.

Code Examples

Sorting by height descending
DF.sort_values(by='height',ascending=False)

Boolean Masks

Basic Operators

Operator Definition
< less than
> greater than
<= less than or equal to
>= greater than or equal to
== exactly equal to
!= not equal to

Advanced Operators

Operator Definition
and check if two things are both true
or check if one of two things is true
in checks if something is in another thing
! not checks if something is false

Operators that Combine Masks

Operator Definition
& checks that two masks are both true

when masking more than one condition, separate the conditions with parenthesis:

  • mask = (condition 1) & (condition 2) & (condition 3)
  • DF[mask]

Code Examples

Mask for a value not equal
mask = DF['homeworld'] != 'Tatooine'
DF[mask]
Mask for a value equal
mask = DF['homeworld'] == 'Tatooine'
DF[mask]
Mask a data frame and save the information in a new variable
mask = DF['height'] > 100
DF_tall = DF[mask].copy()
Masking more than one thing
mask = (DF['homeworld'] == 'Tatooine') & (DF['homeworld'] == 'Naboo') & (DF['homeworld'] == 'Alderaan')
DF_homeworld = DF[mask].copy()

Adding a Column to a Data Frame

To add a column you just enter a new column name into DF[] followed by a calculation

  • **DF[‘new_column_name’] = enter your calculation here

Code Examples

New column that adds two other columns
DF['height_and_mass'] = DF['height'] + DF['mass']
New column that keeps track of boolean mask
DF['is_human_female'] = (DF['species']=='Human') & (DF['sex']=='female')

Grouping Data by a Categorical Column

  • .groupby(‘column name’).function()

The groupby command just separates the data in Pythons memory. Then you apply a function to see the results. In this class we have been doing

my_columns = [list of columns] # must include the column you are grouping by
DF[my_columns).groupby('grouping column').describe()

You can use all sorts of functions at the end:

  • .min()
  • .max()
  • .mean()
  • .median()
  • .sum()
  • .prod()
  • .count()
  • .describe()

if you just want to see one group you can use

  • .get_group()

Code Examples

Group by one column do statistics on one other column
my_columns = ['sex','height']
DF[my_columns].groupby('sex').describe()
Group by one column and then look at one of the groups.
my_columns = ['sex','height']
DF[my_columns].groupby('sex').get_group('female')
Group by one column and then find the average of two other columns.
my_columns = ['sex','height','mass']
DF[my_columns].groupby('sex').mean()

Joining Data Frames

If you ahave two or more data frames that share a column you can join them together

  • pd.merge(DataFrame1, DataFrame2, on=‘shared column’, how=???)
    • how = “left” - DataFrame1 observations are used to set the shared column in the new DataFrame
    • how = “right” - DataFrame2 observations are used to set the shared column in the new DataFrame
    • how = “outer” - All possible observations are used to set the shared column in the new DataFrame
    • how = “inner” - Only observations that are in BOTH DataFrames are used to set the shared column in the new DataFrame

Code Examples - using fake data

Make the data
DF_fake1 = pd.DataFrame({'id':[1,2,3],'data1':['x1','x2','x3']})
DF_fake2 = pd.DataFrame({'id':[1,2,4],'data1':['y1','y2','y4']})

show(DF_fake1)
show(DF_fake2)
id data1
Loading ITables v2.1.4 from the internet... (need help?)
id data1
Loading ITables v2.1.4 from the internet... (need help?)
Merge using how=left - results will have ids = 1,2,3
pd.merge(DF_fake1, DF_fake2, on='id',how='left')
id data1_x data1_y
0 1 x1 y1
1 2 x2 y2
2 3 x3 NaN
Merge using how=right - results will have ids = 1,2,4
pd.merge(DF_fake1, DF_fake2, on='id',how='right')
id data1_x data1_y
0 1 x1 y1
1 2 x2 y2
2 4 NaN y4
Merge using how=outer - results will have ids = 1,2,3,4
pd.merge(DF_fake1, DF_fake2, on='id',how='outer')
id data1_x data1_y
0 1 x1 y1
1 2 x2 y2
2 3 x3 NaN
3 4 NaN y4
Merge using how=inner - results will have ids = 1,2
pd.merge(DF_fake1, DF_fake2, on='id',how='inner')
id data1_x data1_y
0 1 x1 y1
1 2 x2 y2

Using .appy and lambda x: to fix your data

  • DF[‘column name’].apply(lambda x: operation on x)
    • The .apply() command lets you apply a function to every row along your data frame.
    • lambda is a special function you first write the word lambda and choose a variable followed by a colon.

In the example below we are going to create a column in our data frame that counts the total number of films that a Starwars character has been in

Code Examples

Use apply and lambda to break up string data
DF['number_films'] = DF['films'].apply(lambda x: len(x.split(',')))
''' 
Why does this work???

Here we are looking at each piece of data in the flims column.
The data looks like this: 
"A New Hope, The Empire Strikes Back, Return of the Jedi, Revenge of the Sith"

When we do x.split(',') this returns a list: 
["A New Hope", "The Empire Strikes Back", "Return of the Jedi", "Revenge of the Sith"]

We find the length of this list to count up the movies!

'''
DF[['name','number_films']]
Use apply and lambda to convert weight from kg to lbs
DF['weight'] = DF['mass'].apply(lambda x: x*2.20462)