print('Hello World')
Hello World
Hello World!
Please come to office hours to get help!
Be patient and kind with yourself and others. Sometimes getting started can be overwhelming, DON’T GIVE UP, you can do this and I can help!
I can be REALLY FLEXIBLE with deadlines in the first few weeks of class, so if you run into technology issues please don’t panic.
OLD 1. Make sure you can Fork and Clone the Day2 repo from Redlands-DATA101
NEW 1. Make sure Pull any new content from the class repo - then Copy it over into your working diretory.
The first program you write in almost any language is called Hello World.
You can type this code into the notebook or copy and paste it and then run the cell (press play or use SHIFT-ENTER).
print('Hello World')
Hello World
Q. How would you make python print your name? Try making changes!
Now that you have officially programmed something in Python, let’s start doing Data Science!
Python is organized into packages called modules. When you want to use certain programs (functions) you need to install and import the module.
You will need to install packages. Here is our first set of packages. Copy and paste this into your notebook and run the cell (press play or use SHIFT-ENTER).
### This will take a while to run - just let it go.
!conda install -y numpy
!conda install -y pandas
!conda install -y matplotlib
!conda install -y plotly
!conda install -y itables
!conda install -y statsmodels
!conda install -y -c conda-forge python-kaleido
At the top of every JupyterLab Notebook, you will see a buch of package imports. You are basically telling Python what extra funtions you will need. You can just run this cell (press play or use SHIFT-ENTER).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
='colab'
pio.renderers.default
from itables import show
Below is code to explore our first data set. This exploration is from https://datasciencebox.org/ and the original author of this content (written in R) is Mine Çetinkaya-Rundel. I have updated it for our class and translated the code to python.
Introduction
How do various countries vote in the United Nations General Assembly, how have their voting patterns evolved throughout time, and how similarly or differently do they view certain issues? Answering these questions (at a high level) is the focus of this analysis.
Data
The data we’re using originally come from the unvotes R. package. This package provides the voting history of countries in the United Nations General Assembly and the original data can be found HERE.
The data in the .csv (comma separated value) file has been joined in R to help with the analysis
The code below grabs the data from the internet and saves at a Pandas DataFrame named \(DF\).
# Note this takes about a minute to run
= 'https://joannabieri.com/introdatascience/data/unvotes.csv'
file_location = pd.read_csv(file_location)
DF = [int(d.split('-')[0]) for d in DF['date']]
years 'year'] = years
DF[= DF.drop('Unnamed: 0',axis=1) DF
::: {#cell-Show data .cell execution_count=4}
show(DF)
rcid | country | country_code | vote | session | importantvote | date | unres | amend | para | short | descr | short_name | issue | year |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.1.4 from the internet...
(need help?) |
:::
In the table above, see if you can look up information from a UN Country - just type it into the search bar. See if you can answer the following questions:
Q. How many columns are there?
Q. Can you guess what each column data represents? Try to figure it out, but it’s okay if right now your answer is “No Idea!”
Harder Q. How many different countries are there in the data set?
Harder Q. How many rows are there in the data set?
*Note - things listed as Harder Q. are questions that would be hard to answer without more help from python.
::: {#cell-How many countries? Part 1 .cell execution_count=5}
# Python can list all the different countries:
= list(DF['country'].unique())
country_list
# Show the data in a nice way
=['country'])) show(pd.DataFrame(country_list,columns
country |
---|
Loading ITables v2.1.4 from the internet...
(need help?) |
:::
# Python can count up the number of countries.
# Find the length of the list
print(len(country_list))
200
# Optional - Make a nice print statement
print(f'There are {len(country_list)} in our data set!')
There are 200 in our data set!
You Try
See if you can figure out how the change the code below to look at the column labeled “issue”. The goal is to see a list of all the issues. Change the ??? part of the code.
= list(DF[???].unique())
issues_list issues_list
Let’s create a data visualisation that displays how the voting record of the US changed over time on a variety of issues, and compares it to two other countries: UK and Turkey.
We can easily change which countries are being plotted by changing which countries the code above filters for. Note that the country name should be spelled and capitalized exactly the same way as it appears in the data.
The code below does the following:
= ['Turkey', 'United States', 'United Kingdom']
countries = list(DF['issue'].unique())
issues = DF.groupby(['country','issue'])
c_groups print(issues)
['Human rights', 'Economic development', 'Colonialism', 'Palestinian conflict', 'Arms control and disarmament', 'Nuclear weapons and nuclear material']
Now make a pretty picture
There is some more complicated code here to create a beautiful picture, but for now all you need to do is run the code. As the semester goes on we will learn how to make our own beautiful pictures!
def make_plot(countries,issue):
'''
A Python function that takes in the list of countries and issues and makes
a scatter plot of each issue with a trendline for each country.
'''
= []
x_data = []
y_data = []
c_data for cntry in countries:
= c_groups.get_group((cntry,issue))
my_group for y in my_group['year'].unique():
x_data.append(y)= sum(my_group[my_group['year']==y]['vote']=='yes')
tot_yes = tot_yes/len(my_group[my_group['year']==y])*100
percent_yes
y_data.append(percent_yes)
c_data.append(cntry)
= px.scatter(x=x_data, y=y_data,color=c_data,trendline="lowess",labels={"color": "Country"})
fig
fig.update_layout(={
title'text': issue + '<br>',
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
="% Yes")
fig.update_yaxes(title_text="Year")
fig.update_xaxes(title_text= issue
f_name +'.png')
fig.write_image(f_name
fig.show()
for iss in issues:
make_plot(countries,iss)