import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
= 'colab'
pio.renderers.defaule
from itables import show
# This stops a few warning messages from showing
= None
pd.options.mode.chained_assignment import warnings
='ignore', category=FutureWarning) warnings.simplefilter(action
Introduction to Data Science
Web Scraping in Python
Important Information
- Email: joanna_bieri@redlands.edu
- Office Hours: Duke 209 Click Here for Joanna’s Schedule
Announcements
In NEXT WEEK - Data Ethics This week you should be reading your book or articles.
Day 12 Assignment - same drill.
- Make sure Pull any new content from the class repo - then Copy it over into your working diretory.
- Open the file Day##-HW.ipynb and start doing the problems.
- You can do these problems as you follow along with the lecture notes and video.
- Get as far as you can before class.
- Submit what you have so far Commit and Push to Git.
- Take the daily check in quiz on Canvas.
- Come to class with lots of questions!
——————————
Web Scraping Ethical Issues
There are some things to be aware of before you start scraping data from the web.
Some data is private or protected. Just because you have access to a websites data doesn’t mean you are allowed to scrape it. For example, when you log into Facebook or another social media site, you are granted special access to data about your connected people. It is unethical to use that access to scrape their private data!
Some websites have rules against scraping and will cut of service to users who are clearly scraping data. How do they know? Webscrapers access the website very differently that regular users. If they site has a policy about scraping data then you should follow it and/or content them about getting the data if you have a true academic interest in the data.
The line between web scraping and plagiarism can be very blurry. Make sure that you are citing where your data comes from AND not just reproducing the data exactly. Always citing the source of your data and make sure you are doing something new with it.
Ethics are different depending on if you are using the data for a personal project (eg. you just want to check scores for your favorite team daily and print the stuff you care about) vs if you are using the project for your business or website (eg. publishing information to drive clicks to your site/video/account or making money from the data you collect). In the later case it is EXTRA important to respect the original owner of the data. Drive web traffic back to their site, check with them about using their data, etc.
The Ethical Scraper (from https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01):
I, the web scraper will live by the following principles:
- If you have a public API that provides the data I’m looking for, I’ll use it and avoid scraping all together.
- I will always provide a User Agent string that makes my intentions clear and provides a way for you to contact me with questions or concerns.
- I will request data at a reasonable rate. I will strive to never be confused for a DDoS attack.
- I will only save the data I absolutely need from your page. If all I need it OpenGraph meta-data, that’s all I’ll keep.
- I will respect any content I do keep. I’ll never pass it off as my own.
- I will look for ways to return value to you. Maybe I can drive some (real) traffic to your site or credit you in an article or post.
- I will respond in a timely fashion to your outreach and work with you towards a resolution.
- I will scrape for the purpose of creating new value from the data, not to duplicate it.
I am assuming all the examples below are for personal use.
—————–
Using pandas to get table data.
We have already briefly seen this in action!
If the data on the website you are interested in is already writen in a table then Pandas can grab that data and save it to a data frame.
Here is an example of how you could get data about a sports team.
I am a fan of the Las Vegas Aces Basketball team and I might want to know more about the stats of their players. Check out the website:
basketball-reference.com/wnba/teams/SAS/players.html
You can see that the data here is already in a table!
= "https://www.basketball-reference.com/wnba/teams/SAS/players.html"
my_website = pd.read_html(my_website) DF
# Lets see what Pandas got for us:
= DF[0].copy()
ACES ACES
Rk | Player | From | To | Yrs | Unnamed: 5 | G | MP | FG | FGA | ... | PTS | Unnamed: 22 | FG% | 3P% | FT% | Unnamed: 26 | MP.1 | PTS.1 | TRB.1 | AST.1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Danielle Adams | 2011 | 2015 | 5 | NaN | 155 | 3247 | 624 | 1472 | ... | 1771 | NaN | 0.424 | 0.328 | 0.754 | NaN | 20.9 | 11.4 | 4.3 | 0.9 |
1 | 2 | Elisa Aguilar | 2002 | 2002 | 1 | NaN | 28 | 141 | 14 | 33 | ... | 43 | NaN | 0.424 | 0.524 | 0.571 | NaN | 5.0 | 1.5 | 0.4 | 0.6 |
2 | 3 | Kayla Alexander | 2013 | 2017 | 5 | NaN | 154 | 2038 | 278 | 555 | ... | 692 | NaN | 0.501 | NaN | 0.764 | NaN | 13.2 | 4.5 | 3.1 | 0.3 |
3 | 4 | Lindsay Allen | 2018 | 2020 | 2 | NaN | 45 | 642 | 56 | 139 | ... | 144 | NaN | 0.403 | 0.212 | 0.735 | NaN | 14.3 | 3.2 | 1.2 | 2.7 |
4 | 5 | Chantelle Anderson | 2005 | 2007 | 3 | NaN | 68 | 1168 | 152 | 310 | ... | 384 | NaN | 0.490 | NaN | 0.777 | NaN | 17.2 | 5.6 | 2.9 | 0.4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
146 | 147 | Nevriye Yilmaz | 2004 | 2004 | 1 | NaN | 7 | 77 | 6 | 24 | ... | 19 | NaN | 0.250 | 0.143 | 1.000 | NaN | 11.0 | 2.7 | 1.4 | 0.3 |
147 | 148 | Jackie Young | 2019 | 2024 | 6 | NaN | 199 | 5944 | 965 | 2075 | ... | 2685 | NaN | 0.465 | 0.386 | 0.852 | NaN | 29.9 | 13.5 | 4.0 | 4.0 |
148 | 149 | Sophia Young-Malcolm | 2006 | 2015 | 9 | NaN | 301 | 9258 | 1659 | 3545 | ... | 4300 | NaN | 0.468 | 0.223 | 0.718 | NaN | 30.8 | 14.3 | 6.0 | 1.8 |
149 | 150 | Tamera Young | 2018 | 2019 | 2 | NaN | 67 | 1509 | 197 | 495 | ... | 507 | NaN | 0.398 | 0.310 | 0.680 | NaN | 22.5 | 7.6 | 4.4 | 2.4 |
150 | 151 | Shanna Zolman | 2006 | 2009 | 3 | NaN | 87 | 1273 | 225 | 567 | ... | 635 | NaN | 0.397 | 0.402 | 0.811 | NaN | 14.6 | 7.3 | 1.1 | 0.8 |
151 rows × 31 columns
Ask some questions:
Do I know what every column means and what kind of data is in that column?
Are there certain columns I am interested in?
What are all these “Unnamed” columns and do I need them?
Can I come up with some Data focused questions?
ACES.shape
(151, 31)
ACES.columns
Index(['Rk', 'Player', 'From', 'To', 'Yrs', 'Unnamed: 5', 'G', 'MP', 'FG',
'FGA', '3P', '3PA', 'FT', 'FTA', 'ORB', 'TRB', 'AST', 'STL', 'BLK',
'TOV', 'PF', 'PTS', 'Unnamed: 22', 'FG%', '3P%', 'FT%', 'Unnamed: 26',
'MP.1', 'PTS.1', 'TRB.1', 'AST.1'],
dtype='object')
= ['Unnamed: 5', 'Unnamed: 22','Unnamed: 26']
columns ACES[columns]
Unnamed: 5 | Unnamed: 22 | Unnamed: 26 | |
---|---|---|---|
0 | NaN | NaN | NaN |
1 | NaN | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | NaN | NaN |
4 | NaN | NaN | NaN |
... | ... | ... | ... |
146 | NaN | NaN | NaN |
147 | NaN | NaN | NaN |
148 | NaN | NaN | NaN |
149 | NaN | NaN | NaN |
150 | NaN | NaN | NaN |
151 rows × 3 columns
Dealing with NaNs
So there is nothing in the unnamed columns! Are all of the rows are NaN?
There are some really nice commands for dealing with NaN’s in your data:
First NaNs are a strange data type (np.nan) they are considered a float - like a decimal. In most raw data sets NaN means no data was given for that observation and variable, but be careful NaN can also happen if you do a calculation and accidentally divide by zero.
.isna() creates a mask for whether or not there is a NaN in each row of the data.
.fillna() will replace NaN in your data set with whatever you put inside the parenthesis
.dropna() will drop all rows that contain NaN - becareful with this command. You want to keep as much data as possible and .dropna() might delete too much!
=True)
ACES.dropna(inplace ACES
Rk | Player | From | To | Yrs | Unnamed: 5 | G | MP | FG | FGA | ... | PTS | Unnamed: 22 | FG% | 3P% | FT% | Unnamed: 26 | MP.1 | PTS.1 | TRB.1 | AST.1 |
---|
0 rows × 31 columns
Oh No! We deleted our data!!!!
We got rid of any row that contains NaN, but if we look at our unnamed columns, they are all NaNs!
Let’s reload the data and try again
# Get the data again
= DF[0].copy() ACES
ACES[columns]
Unnamed: 5 | Unnamed: 22 | Unnamed: 26 | |
---|---|---|---|
0 | NaN | NaN | NaN |
1 | NaN | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | NaN | NaN |
4 | NaN | NaN | NaN |
... | ... | ... | ... |
146 | NaN | NaN | NaN |
147 | NaN | NaN | NaN |
148 | NaN | NaN | NaN |
149 | NaN | NaN | NaN |
150 | NaN | NaN | NaN |
151 rows × 3 columns
sum() ACES[columns].isna().
Unnamed: 5 151
Unnamed: 22 151
Unnamed: 26 151
dtype: int64
Now we see that these columns are all NaN when we add them up, so we can just ignore the columns in our future data analysis.
This was an example of when Pandas worked well!
Sometimes you get errors!
- You can try installing the packages that python asks for, but some websites use pretty advanced coding.
- When you see a 404 forbidden error, this means that the website is trying to stop you from scraping and you would need to use even more advanced techniques!
= 'https://www.scrapethissite.com/pages/simple/'
website = pd.read_html(website) df
HTTPError: HTTP Error 403: Forbidden
Using Beautiful Soup to get HTML code
The code below will work well for static websites - aka sites that don’t use fancy things to actively load content.
Websites are built using html code. That code tells the web browser (FireFox, Chrome, etc) what to display. Websites can be very simple (just html) to much more complicated (java script +). When you load a website you can always see the source code:
- Right Click - view page source
This is what beautiful soup downloads. For static (simple) sites this code is immediately available. More complicated sites might require Python to open the webpage, let the content render, and then download the code.
How to get data from static sites:
You should already have the packages bs4 and requests but if you get an error try running:
!conda install -y bs4
!conda install -y requests
import requests
from bs4 import BeautifulSoup
= 'https://www.scrapethissite.com/pages/simple/' website
= requests.get(website)
raw_code = raw_code.text
html_doc = BeautifulSoup(html_doc, 'html.parser') soup
Lets see what is in soup
Uncomment this line and run the cell. You will see a TON of text!
#soup
This is HTML code
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping</title>
<link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="A single page that lists information about all the countries in the world. Good for those just get started with web scraping." name="description"/>
<link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
<link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
<meta content="noindex" name="robots"/>
<link href="https://lipis.github.io/flag-icon-css/css/flag-icon.css" rel="stylesheet"/>
</head>
<body>
<nav id="site-nav">
<div class="container">
<div class="col-md-12">
<ul class="nav nav-tabs">
<li id="nav-homepage">
<a class="nav-link hidden-sm hidden-xs" href="/">
<img id="nav-logo" src="/static/images/scraper-icon.png"/>
Scrape This Site
</a>
</li>
<li id="nav-sandbox">
<a class="nav-link" href="/pages/">
<i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>
Sandbox
</a>
WHAT A MESS!
The information in soup is ALL of the code and unless you are awesome at reading HTML, this is indecipherable. We need to be able to find specific parts of this to extract the data.
Extracting data from HTML:
We will use the soup.find_all() function.
Here is the simplified function signature:
soup.find_all(name=None,attrs={})
You can type soup.find_all? and run it to see all the information about additional arguments and advanced processes.
Here is how we will mostly use it, but there are much more advanced things you can do:
soup.find_all( <type of section>, <info> )
The .find_all() function searches through the information in soup to match and return only sections that match the info. Here are some important types you might search for:
‘h2’ - this is a heading
‘div’ - this divides a block of information
‘span’ - this divides inline information
‘a’ - this specifies a hyperlink
‘li’ - this is a list item
class_= - many things have the class label (notice the underscore!)
string= - you can also search by strings.
Using Developer tools:
To figure out what data to extract I suggest you use developer tools on the website to find what you need. Navigate to the website:
I really like Brave Browser or Google Chrome for this, but most browsers with have More Tools/Developer Tools where you can see the code.
Using developer tools you can decide what you are looking for. Let’s say I want the country information, here is an examle of the code that contains the country information:
<div class="col-md-4 country">
<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
Andorra
</h3>
<div class="country-info">
<strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br>
<strong>Population:</strong> <span class="country-population">84000</span><br>
<strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br>
</div>
</div>
So I can use soup.find_all() to find pieces of this information!
Search for all the country names
The names of the country are inside
<h3 class="country-name">
So lets search for this:
= soup.find_all('h3',class_="country-name")
result #print(result)
This is still a mess
[<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
Andorra
</h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-ae"></i>
United Arab Emirates
</h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-af"></i>
Afghanistan
</h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-ag"></i>
Antigua and Barbuda
</h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-ai"></i>
Anguilla
</h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-al"></i>
Albania
We can see the country names but they are surrounded by other junk. Here is how we will handle this.
We will start and EMPTY data frame using
DF = pd.DataFrame()
We will add our soup.find_all results as a column of the data frame
We will fix the data to strip off all of the unneeded text.
How to get the text
What this returns is a list of all the text that is inside a link block of code. If we just want to look at the text we can!
0] result[
<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
Andorra
</h3>
0].text result[
'\n\n Andorra\n '
0].text.lstrip().rstrip() result[
'Andorra'
Put this into a data frame
Now that we see what wee need to do to get the data in the form we want it, we can use a data frame and a lambda to get the data in a nice format.
= pd.DataFrame()
DF 'country']=result
DF['country'] =DF['country'].apply(lambda x: x.text.rstrip().lstrip())
DF[ DF
country | |
---|---|
0 | Andorra |
1 | United Arab Emirates |
2 | Afghanistan |
3 | Antigua and Barbuda |
4 | Anguilla |
... | ... |
245 | Yemen |
246 | Mayotte |
247 | South Africa |
248 | Zambia |
249 | Zimbabwe |
250 rows × 1 columns
Lets try again and this time add the country capital
If we look at the code we see the country capital is inside code that looks like this:
<strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br>
so lets search for a span where class is “country-capital”
= soup.find_all('span',class_="country-capital") result
0] result[
<span class="country-capital">Andorra la Vella</span>
0].text result[
'Andorra la Vella'
'capital']=result
DF['capital'] = DF['capital'].apply(lambda x: x.text)
DF[ DF
country | capital | |
---|---|---|
0 | Andorra | Andorra la Vella |
1 | United Arab Emirates | Abu Dhabi |
2 | Afghanistan | Kabul |
3 | Antigua and Barbuda | St. John's |
4 | Anguilla | The Valley |
... | ... | ... |
245 | Yemen | Sanaa |
246 | Mayotte | Mamoudzou |
247 | South Africa | Pretoria |
248 | Zambia | Lusaka |
249 | Zimbabwe | Harare |
250 rows × 2 columns
Your turn to try this!
Q Now it’s your turn. See if you can write code that gets the population and area information into the data frame. See if you can make your example match what I get below, including having the correct data types. Population should be an int and area should be a float.
Your goal is to get a data frame that looks like this:
country | capital | population | area | |
---|---|---|---|---|
0 | Andorra | Andorra la Vella | 84000 | 468.0 |
1 | United Arab Emirates | Abu Dhabi | 4975593 | 82880.0 |
2 | Afghanistan | Kabul | 29121286 | 647500.0 |
3 | Antigua and Barbuda | St. John's | 86754 | 443.0 |
4 | Anguilla | The Valley | 13254 | 102.0 |
... | ... | ... | ... | ... |
245 | Yemen | Sanaa | 23495361 | 527970.0 |
246 | Mayotte | Mamoudzou | 159042 | 374.0 |
247 | South Africa | Pretoria | 49000000 | 1219912.0 |
248 | Zambia | Lusaka | 13460305 | 752614.0 |
249 | Zimbabwe | Harare | 11651858 | 390580.0 |
250 rows × 4 columns
DF.dtypes
country object
capital object
population int64
area float64
dtype: object
Where can you get more practice
Here is a website dedicated to allowing students to practice webscraping:
There are “sandbox” websites that are indended for scraping. The only thing the site asks is:
Please be Well-Behaved
Just like any site you’d scrape out in the wild wild west (www), please be mindful of other users trying to access the site. From a technical standpoint, you must observe the following rules:
- Clients may only make a maximum of one request per second
- Clients must send an identifying user agent
- Clients must respect this site’s robots.txt file
Any client that violates the rules above or otherwise tries to interfere with the site’s operation will be subject to a temporary or permanent ban.
Be a good web scraping citizen.
You try scraping (with help)
We are going to scrape the site
to get the list of all links. Open the website and look at the developer tools. You should see something like:
<div class="page">
<h3 class="page-title">
<a href="/pages/simple/">Countries of the World: A Simple Example</a>
</h3>
<p class="lead session-desc">
A single page that lists information about all the countries in the world. Good for those just get started with web scraping.
</p>
<hr>
</div>
for each link the the website.
Here is our goal: Make a pandas data frame that contains three columns:
- “site_name” - which contains just the words of the link
- “link” - which contains just the website part of the link
- “description” - which contains the words below the link
When looking for links you can use
result = soup.find_all('a')
result[0].text # Get the text associated with the link
result[0].get('href') # Get the link location
Exercise 1:
See if you can do this without looking at my code in the notes! What would your plan have to be?
- Get the soup and find_all to get links (notice the first few entries are not helpful)
- make an empty data frame
- add a new column to contain the link text
- fix that text so it looks nice
- add a new column to contain the actual link
- fix that text so it looks nice
- do another find all to find the words
- add a new column to contain the words
- fix the text so it looks nice
The answer (below)
but seriously try this on your own
before
you
look at
the
ANSWER
# Get the soup
= 'https://www.scrapethissite.com/pages/'
website = requests.get(website)
raw_code = raw_code.text
html_doc = BeautifulSoup(html_doc, 'html.parser') soup
# Search the soup for links
= soup.find_all('a') result
# Look at what we end up with and fix it till it looks good
0].text.lstrip().rstrip() result[
'Scrape This Site'
# The first few are from the top of the page not the main content
5].text.lstrip().rstrip() result[
'Countries of the World: A Simple Example'
Notice:
I don’t get the links I want until I get the the fifth result…. I might have to remove these first few rows from my data eventually!
# Create a data frame and add the site names
= pd.DataFrame()
DF 'site_name'] = result
DF['site_name'] = DF['site_name'].apply(lambda x: x.text.lstrip().rstrip())
DF[ DF
site_name | |
---|---|
0 | Scrape This Site |
1 | Sandbox |
2 | Lessons |
3 | FAQ |
4 | Login |
5 | Countries of the World: A Simple Example |
6 | Hockey Teams: Forms, Searching and Pagination |
7 | Oscar Winning Films: AJAX and Javascript |
8 | Turtles All the Way Down: Frames & iFrames |
9 | Advanced Topics: Real World Challenges You'll ... |
# Look at what we ended up with but this time get the links
0].get('href') result[
'/'
5].get('href') result[
'/pages/simple/'
# Add these to the data frame
'link'] = result
DF['link'] = DF['link'].apply(lambda x: x.get('href'))
DF[ DF
site_name | link | |
---|---|---|
0 | Scrape This Site | / |
1 | Sandbox | /pages/ |
2 | Lessons | /lessons/ |
3 | FAQ | /faq/ |
4 | Login | /login/ |
5 | Countries of the World: A Simple Example | /pages/simple/ |
6 | Hockey Teams: Forms, Searching and Pagination | /pages/forms/ |
7 | Oscar Winning Films: AJAX and Javascript | /pages/ajax-javascript/ |
8 | Turtles All the Way Down: Frames & iFrames | /pages/frames/ |
9 | Advanced Topics: Real World Challenges You'll ... | /pages/advanced/ |
# EXTRA - lets add the base website info
= 'https://www.scrapethissite.com'
base_website 'link'] = DF['link'].apply(lambda x: base_website+x)
DF[ DF
site_name | link | |
---|---|---|
0 | Scrape This Site | https://www.scrapethissite.com/ |
1 | Sandbox | https://www.scrapethissite.com/pages/ |
2 | Lessons | https://www.scrapethissite.com/lessons/ |
3 | FAQ | https://www.scrapethissite.com/faq/ |
4 | Login | https://www.scrapethissite.com/login/ |
5 | Countries of the World: A Simple Example | https://www.scrapethissite.com/pages/simple/ |
6 | Hockey Teams: Forms, Searching and Pagination | https://www.scrapethissite.com/pages/forms/ |
7 | Oscar Winning Films: AJAX and Javascript | https://www.scrapethissite.com/pages/ajax-java... |
8 | Turtles All the Way Down: Frames & iFrames | https://www.scrapethissite.com/pages/frames/ |
9 | Advanced Topics: Real World Challenges You'll ... | https://www.scrapethissite.com/pages/advanced/ |
# Remove the top few links
= DF.drop([0,1,2,3,4])
DF DF
site_name | link | |
---|---|---|
5 | Countries of the World: A Simple Example | https://www.scrapethissite.com/pages/simple/ |
6 | Hockey Teams: Forms, Searching and Pagination | https://www.scrapethissite.com/pages/forms/ |
7 | Oscar Winning Films: AJAX and Javascript | https://www.scrapethissite.com/pages/ajax-java... |
8 | Turtles All the Way Down: Frames & iFrames | https://www.scrapethissite.com/pages/frames/ |
9 | Advanced Topics: Real World Challenges You'll ... | https://www.scrapethissite.com/pages/advanced/ |
# Now search the soup for the words
# They are inside p class="lead session-desc"
= soup.find_all('p',class_="lead session-desc") result
0].text.rstrip().lstrip() result[
'A single page that lists information about all the countries in the world. Good for those just get started with web scraping.'
'description'] = result
DF['description'] = DF['description'].apply(lambda x: x.text.rstrip().lstrip())
DF[ DF
site_name | link | description | |
---|---|---|---|
5 | Countries of the World: A Simple Example | https://www.scrapethissite.com/pages/simple/ | A single page that lists information about all... |
6 | Hockey Teams: Forms, Searching and Pagination | https://www.scrapethissite.com/pages/forms/ | Browse through a database of NHL team stats si... |
7 | Oscar Winning Films: AJAX and Javascript | https://www.scrapethissite.com/pages/ajax-java... | Click through a bunch of great films. Learn ho... |
8 | Turtles All the Way Down: Frames & iFrames | https://www.scrapethissite.com/pages/frames/ | Some older sites might still use frames to bre... |
9 | Advanced Topics: Real World Challenges You'll ... | https://www.scrapethissite.com/pages/advanced/ | Scraping real websites, you're likely run into... |
Challenge Problem
Here is another website to scrape. See if you can create a data frame that looks like the one below. Notice that you can only scrape the first page.
If you want to try scraping the other pages you have to notice how the website updates its address for each page. Then write a for loop to loop through how ever many pages you want to scrape. Do the same set of operations for each page keep adding data to your data frame.
Make a histogram of your final data.
='https://books.toscrape.com/index.html'
website= requests.get(website)
raw_code = raw_code.text
html_doc = BeautifulSoup(html_doc, 'html.parser') soup
Try to scrape the name, the link to the book, and the prices! I decided to put the name and link information into a single column and then break that apart
names_links | links | names | price | |
---|---|---|---|---|
0 | [[A Light in the ...]] | catalogue/a-light-in-the-attic_1000/index.html | A Light in the ... | 51.77 |
1 | [[Tipping the Velvet]] | catalogue/tipping-the-velvet_999/index.html | Tipping the Velvet | 53.74 |
2 | [[Soumission]] | catalogue/soumission_998/index.html | Soumission | 50.10 |
3 | [[Sharp Objects]] | catalogue/sharp-objects_997/index.html | Sharp Objects | 47.82 |
4 | [[Sapiens: A Brief History ...]] | catalogue/sapiens-a-brief-history-of-humankind... | Sapiens: A Brief History ... | 54.23 |
5 | [[The Requiem Red]] | catalogue/the-requiem-red_995/index.html | The Requiem Red | 22.65 |
6 | [[The Dirty Little Secrets ...]] | catalogue/the-dirty-little-secrets-of-getting-... | The Dirty Little Secrets ... | 33.34 |
7 | [[The Coming Woman: A ...]] | catalogue/the-coming-woman-a-novel-based-on-th... | The Coming Woman: A ... | 17.93 |
8 | [[The Boys in the ...]] | catalogue/the-boys-in-the-boat-nine-americans-... | The Boys in the ... | 22.60 |
9 | [[The Black Maria]] | catalogue/the-black-maria_991/index.html | The Black Maria | 52.15 |
10 | [[Starving Hearts (Triangular Trade ...]] | catalogue/starving-hearts-triangular-trade-tri... | Starving Hearts (Triangular Trade ... | 13.99 |
11 | [[Shakespeare's Sonnets]] | catalogue/shakespeares-sonnets_989/index.html | Shakespeare's Sonnets | 20.66 |
12 | [[Set Me Free]] | catalogue/set-me-free_988/index.html | Set Me Free | 17.46 |
13 | [[Scott Pilgrim's Precious Little ...]] | catalogue/scott-pilgrims-precious-little-life-... | Scott Pilgrim's Precious Little ... | 52.29 |
14 | [[Rip it Up and ...]] | catalogue/rip-it-up-and-start-again_986/index.... | Rip it Up and ... | 35.02 |
15 | [[Our Band Could Be ...]] | catalogue/our-band-could-be-your-life-scenes-f... | Our Band Could Be ... | 57.25 |
16 | [[Olio]] | catalogue/olio_984/index.html | Olio | 23.88 |
17 | [[Mesaerion: The Best Science ...]] | catalogue/mesaerion-the-best-science-fiction-s... | Mesaerion: The Best Science ... | 37.59 |
18 | [[Libertarianism for Beginners]] | catalogue/libertarianism-for-beginners_982/ind... | Libertarianism for Beginners | 51.33 |
19 | [[It's Only the Himalayas]] | catalogue/its-only-the-himalayas_981/index.html | It's Only the Himalayas | 45.17 |
= px.histogram(DF_plot,x='price',color='names')
fig
fig.show()