Introduction to Data Science

Hello World!

Author

Joanna Bieri
DATA101

Important Information

Announcements

Please come to office hours to get help!

Be patient and kind with yourself and others. Sometimes getting started can be overwhelming, DON’T GIVE UP, you can do this and I can help!

I can be REALLY FLEXIBLE with deadlines in the first few weeks of class, so if you run into technology issues please don’t panic.

Day 2 Assignment

OLD 1. Make sure you can Fork and Clone the Day2 repo from Redlands-DATA101

NEW 1. Make sure Pull any new content from the class repo - then Copy it over into your working diretory.

  1. Open the file Day2-HW.ipynb and start doing the problems.
    • You can do these problems as you follow along with the lecture notes and video.
  2. Get as far as you can before class.
  3. Submit what you have so far Commit and Push to Git.
  4. Take the daily check in quiz on Canvas.
  5. Come to class with lots of questions!

Hello World:

The first program you write in almost any language is called Hello World.

You can type this code into the notebook or copy and paste it and then run the cell (press play or use SHIFT-ENTER).

print('Hello World')
Hello World

Q. How would you make python print your name? Try making changes!


Installing modules:

Now that you have officially programmed something in Python, let’s start doing Data Science!

Python is organized into packages called modules. When you want to use certain programs (functions) you need to install and import the module.

You will need to install packages. Here is our first set of packages. Copy and paste this into your notebook and run the cell (press play or use SHIFT-ENTER).

### This will take a while to run - just let it go.
!conda install -y numpy
!conda install -y pandas
!conda install -y matplotlib
!conda install -y plotly
!conda install -y itables
!conda install -y statsmodels
!conda install -y -c conda-forge python-kaleido

Importing Packages:

At the top of every JupyterLab Notebook, you will see a buch of package imports. You are basically telling Python what extra funtions you will need. You can just run this cell (press play or use SHIFT-ENTER).

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default='colab'

from itables import show

Let’s do some Data Science!

Below is code to explore our first data set. This exploration is from https://datasciencebox.org/ and the original author of this content (written in R) is Mine Çetinkaya-Rundel. I have updated it for our class and translated the code to python.

Introduction

How do various countries vote in the United Nations General Assembly, how have their voting patterns evolved throughout time, and how similarly or differently do they view certain issues? Answering these questions (at a high level) is the focus of this analysis.

Data

The data we’re using originally come from the unvotes R. package. This package provides the voting history of countries in the United Nations General Assembly and the original data can be found HERE.

The data in the .csv (comma separated value) file has been joined in R to help with the analysis

The code below grabs the data from the internet and saves at a Pandas DataFrame named \(DF\).

# Note this takes about a minute to run
file_location = 'https://joannabieri.com/introdatascience/data/unvotes.csv'
DF = pd.read_csv(file_location)
years = [int(d.split('-')[0]) for d in DF['date']]
DF['year'] = years
DF = DF.drop('Unnamed: 0',axis=1)

::: {#cell-Show data .cell execution_count=4}

show(DF)
rcid country country_code vote session importantvote date unres amend para short descr short_name issue year
Loading ITables v2.1.4 from the internet... (need help?)

:::

Initial Data Exploration

In the table above, see if you can look up information from a UN Country - just type it into the search bar. See if you can answer the following questions:

Q. How many columns are there?

Q. Can you guess what each column data represents? Try to figure it out, but it’s okay if right now your answer is “No Idea!”

Harder Q. How many different countries are there in the data set?

Harder Q. How many rows are there in the data set?

*Note - things listed as Harder Q. are questions that would be hard to answer without more help from python.

::: {#cell-How many countries? Part 1 .cell execution_count=5}

# Python can list all the different countries:
country_list = list(DF['country'].unique())

# Show the data in a nice way
show(pd.DataFrame(country_list,columns=['country']))
country
Loading ITables v2.1.4 from the internet... (need help?)

:::

# Python can count up the number of countries.
# Find the length of the list
print(len(country_list))
200
# Optional - Make a nice print statement
print(f'There are {len(country_list)} in our data set!')
There are 200 in our data set!

You Try

See if you can figure out how the change the code below to look at the column labeled “issue”. The goal is to see a list of all the issues. Change the ??? part of the code.

issues_list = list(DF[???].unique())
issues_list

Data Visualization

Let’s create a data visualisation that displays how the voting record of the US changed over time on a variety of issues, and compares it to two other countries: UK and Turkey.

We can easily change which countries are being plotted by changing which countries the code above filters for. Note that the country name should be spelled and capitalized exactly the same way as it appears in the data.

The code below does the following:

  • Selects the three countries we are focusing on.
  • Gets the list of all the issues in the data set.
  • Groups the data set into subsections for each country and issue.
countries = ['Turkey', 'United States', 'United Kingdom']
issues = list(DF['issue'].unique())
c_groups = DF.groupby(['country','issue'])
print(issues)
['Human rights', 'Economic development', 'Colonialism', 'Palestinian conflict', 'Arms control and disarmament', 'Nuclear weapons and nuclear material']

Now make a pretty picture

There is some more complicated code here to create a beautiful picture, but for now all you need to do is run the code. As the semester goes on we will learn how to make our own beautiful pictures!

def make_plot(countries,issue):
    '''
    A Python function that takes in the list of countries and issues and makes
    a scatter plot of each issue with a trendline for each country.
    '''
    x_data = []
    y_data = []
    c_data = []
    for cntry in countries:
        my_group = c_groups.get_group((cntry,issue))
        for y in my_group['year'].unique():
            x_data.append(y)
            tot_yes = sum(my_group[my_group['year']==y]['vote']=='yes')
            percent_yes = tot_yes/len(my_group[my_group['year']==y])*100
            y_data.append(percent_yes)
            c_data.append(cntry)

    fig = px.scatter(x=x_data, y=y_data,color=c_data,trendline="lowess",labels={"color": "Country"})

    fig.update_layout(
        title={
            'text': issue + '<br>',
            'y':0.9,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'})
    fig.update_yaxes(title_text="% Yes")
    fig.update_xaxes(title_text="Year")
    f_name = issue
    fig.write_image(f_name+'.png')
    fig.show()
    
for iss in issues:
    make_plot(countries,iss)