Introduction to Data Science

Web Scraping in Python

Author

Joanna Bieri
DATA101

Important Information

Email: joanna_bieri@redlands.edu
Office Hours: Duke 209 Click Here for Joanna’s Schedule

Announcements

In NEXT WEEK - Data Ethics This week you should be reading your book or articles.

Day 12 Assignment - same drill.

Make sure Pull any new content from the class repo - then Copy it over into your working diretory.
Open the file Day##-HW.ipynb and start doing the problems.
- You can do these problems as you follow along with the lecture notes and video.
Get as far as you can before class.
Submit what you have so far Commit and Push to Git.
Take the daily check in quiz on Canvas.
Come to class with lots of questions!

——————————

Web Scraping Ethical Issues

There are some things to be aware of before you start scraping data from the web.

Some data is private or protected. Just because you have access to a websites data doesn’t mean you are allowed to scrape it. For example, when you log into Facebook or another social media site, you are granted special access to data about your connected people. It is unethical to use that access to scrape their private data!
Some websites have rules against scraping and will cut of service to users who are clearly scraping data. How do they know? Webscrapers access the website very differently that regular users. If they site has a policy about scraping data then you should follow it and/or content them about getting the data if you have a true academic interest in the data.
The line between web scraping and plagiarism can be very blurry. Make sure that you are citing where your data comes from AND not just reproducing the data exactly. Always citing the source of your data and make sure you are doing something new with it.
Ethics are different depending on if you are using the data for a personal project (eg. you just want to check scores for your favorite team daily and print the stuff you care about) vs if you are using the project for your business or website (eg. publishing information to drive clicks to your site/video/account or making money from the data you collect). In the later case it is EXTRA important to respect the original owner of the data. Drive web traffic back to their site, check with them about using their data, etc.

The Ethical Scraper (from https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01):

I, the web scraper will live by the following principles:

If you have a public API that provides the data I’m looking for, I’ll use it and avoid scraping all together.
I will always provide a User Agent string that makes my intentions clear and provides a way for you to contact me with questions or concerns.
I will request data at a reasonable rate. I will strive to never be confused for a DDoS attack.
I will only save the data I absolutely need from your page. If all I need it OpenGraph meta-data, that’s all I’ll keep.
I will respect any content I do keep. I’ll never pass it off as my own.
I will look for ways to return value to you. Maybe I can drive some (real) traffic to your site or credit you in an article or post.
I will respond in a timely fashion to your outreach and work with you towards a resolution.
I will scrape for the purpose of creating new value from the data, not to duplicate it.

I am assuming all the examples below are for personal use.

—————–

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

# This stops a few warning messages from showing
pd.options.mode.chained_assignment = None 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Using pandas to get table data.

We have already briefly seen this in action!

If the data on the website you are interested in is already writen in a table then Pandas can grab that data and save it to a data frame.

Here is an example of how you could get data about a sports team.

I am a fan of the Las Vegas Aces Basketball team and I might want to know more about the stats of their players. Check out the website:

basketball-reference.com/wnba/teams/SAS/players.html

You can see that the data here is already in a table!

my_website = "https://www.basketball-reference.com/wnba/teams/SAS/players.html"
DF = pd.read_html(my_website)

# Lets see what Pandas got for us:
ACES = DF[0].copy()
ACES

	Rk	Player	From	To	Yrs	Unnamed: 5	G	MP	FG	FGA	...	PTS	Unnamed: 22	FG%	3P%	FT%	Unnamed: 26	MP.1	PTS.1	TRB.1	AST.1
0	1	Danielle Adams	2011	2015	5	NaN	155	3247	624	1472	...	1771	NaN	0.424	0.328	0.754	NaN	20.9	11.4	4.3	0.9
1	2	Elisa Aguilar	2002	2002	1	NaN	28	141	14	33	...	43	NaN	0.424	0.524	0.571	NaN	5.0	1.5	0.4	0.6
2	3	Kayla Alexander	2013	2017	5	NaN	154	2038	278	555	...	692	NaN	0.501	NaN	0.764	NaN	13.2	4.5	3.1	0.3
3	4	Lindsay Allen	2018	2020	2	NaN	45	642	56	139	...	144	NaN	0.403	0.212	0.735	NaN	14.3	3.2	1.2	2.7
4	5	Chantelle Anderson	2005	2007	3	NaN	68	1168	152	310	...	384	NaN	0.490	NaN	0.777	NaN	17.2	5.6	2.9	0.4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
146	147	Nevriye Yilmaz	2004	2004	1	NaN	7	77	6	24	...	19	NaN	0.250	0.143	1.000	NaN	11.0	2.7	1.4	0.3
147	148	Jackie Young	2019	2024	6	NaN	199	5944	965	2075	...	2685	NaN	0.465	0.386	0.852	NaN	29.9	13.5	4.0	4.0
148	149	Sophia Young-Malcolm	2006	2015	9	NaN	301	9258	1659	3545	...	4300	NaN	0.468	0.223	0.718	NaN	30.8	14.3	6.0	1.8
149	150	Tamera Young	2018	2019	2	NaN	67	1509	197	495	...	507	NaN	0.398	0.310	0.680	NaN	22.5	7.6	4.4	2.4
150	151	Shanna Zolman	2006	2009	3	NaN	87	1273	225	567	...	635	NaN	0.397	0.402	0.811	NaN	14.6	7.3	1.1	0.8

151 rows × 31 columns

Ask some questions:

Do I know what every column means and what kind of data is in that column?
Are there certain columns I am interested in?
What are all these “Unnamed” columns and do I need them?
Can I come up with some Data focused questions?

ACES.shape

(151, 31)

ACES.columns

Index(['Rk', 'Player', 'From', 'To', 'Yrs', 'Unnamed: 5', 'G', 'MP', 'FG',
       'FGA', '3P', '3PA', 'FT', 'FTA', 'ORB', 'TRB', 'AST', 'STL', 'BLK',
       'TOV', 'PF', 'PTS', 'Unnamed: 22', 'FG%', '3P%', 'FT%', 'Unnamed: 26',
       'MP.1', 'PTS.1', 'TRB.1', 'AST.1'],
      dtype='object')

columns = ['Unnamed: 5', 'Unnamed: 22','Unnamed: 26']
ACES[columns]

	Unnamed: 5	Unnamed: 22	Unnamed: 26
0	NaN	NaN	NaN
1	NaN	NaN	NaN
2	NaN	NaN	NaN
3	NaN	NaN	NaN
4	NaN	NaN	NaN
...	...	...	...
146	NaN	NaN	NaN
147	NaN	NaN	NaN
148	NaN	NaN	NaN
149	NaN	NaN	NaN
150	NaN	NaN	NaN

151 rows × 3 columns

Dealing with NaNs

So there is nothing in the unnamed columns! Are all of the rows are NaN?

There are some really nice commands for dealing with NaN’s in your data:

First NaNs are a strange data type (np.nan) they are considered a float - like a decimal. In most raw data sets NaN means no data was given for that observation and variable, but be careful NaN can also happen if you do a calculation and accidentally divide by zero.
.isna() creates a mask for whether or not there is a NaN in each row of the data.
.fillna() will replace NaN in your data set with whatever you put inside the parenthesis
.dropna() will drop all rows that contain NaN - becareful with this command. You want to keep as much data as possible and .dropna() might delete too much!

ACES.dropna(inplace=True)
ACES

	Rk	Player	From	To	Yrs	Unnamed: 5	G	MP	FG	FGA	...	PTS	Unnamed: 22	FG%	3P%	FT%	Unnamed: 26	MP.1	PTS.1	TRB.1	AST.1

0 rows × 31 columns

Oh No! We deleted our data!!!!

We got rid of any row that contains NaN, but if we look at our unnamed columns, they are all NaNs!

Let’s reload the data and try again

# Get the data again
ACES = DF[0].copy()

ACES[columns]

	Unnamed: 5	Unnamed: 22	Unnamed: 26
0	NaN	NaN	NaN
1	NaN	NaN	NaN
2	NaN	NaN	NaN
3	NaN	NaN	NaN
4	NaN	NaN	NaN
...	...	...	...
146	NaN	NaN	NaN
147	NaN	NaN	NaN
148	NaN	NaN	NaN
149	NaN	NaN	NaN
150	NaN	NaN	NaN

151 rows × 3 columns

ACES[columns].isna().sum()

Unnamed: 5     151
Unnamed: 22    151
Unnamed: 26    151
dtype: int64

Now we see that these columns are all NaN when we add them up, so we can just ignore the columns in our future data analysis.

This was an example of when Pandas worked well!

Sometimes you get errors!

You can try installing the packages that python asks for, but some websites use pretty advanced coding.
When you see a 404 forbidden error, this means that the website is trying to stop you from scraping and you would need to use even more advanced techniques!

website = 'https://www.scrapethissite.com/pages/simple/'
df = pd.read_html(website)

HTTPError: HTTP Error 403: Forbidden

Using Beautiful Soup to get HTML code

The code below will work well for static websites - aka sites that don’t use fancy things to actively load content.

Websites are built using html code. That code tells the web browser (FireFox, Chrome, etc) what to display. Websites can be very simple (just html) to much more complicated (java script +). When you load a website you can always see the source code:

Right Click - view page source

This is what beautiful soup downloads. For static (simple) sites this code is immediately available. More complicated sites might require Python to open the webpage, let the content render, and then download the code.

How to get data from static sites:

You should already have the packages bs4 and requests but if you get an error try running:

!conda install -y bs4
!conda install -y requests

import requests
from bs4 import BeautifulSoup

website = 'https://www.scrapethissite.com/pages/simple/'

raw_code = requests.get(website)
html_doc = raw_code.text
soup = BeautifulSoup(html_doc, 'html.parser')

Lets see what is in soup

Uncomment this line and run the cell. You will see a TON of text!

#soup

This is HTML code

    <!DOCTYPE html>
    
    <html lang="en">
    <head>
    <meta charset="utf-8"/>
    <title>Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping</title>
    <link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
    <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
    <meta content="A single page that lists information about all the countries in the world. Good for those just get started with web scraping." name="description"/>
    <link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
    <link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
    <link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
    <meta content="noindex" name="robots"/>
    <link href="https://lipis.github.io/flag-icon-css/css/flag-icon.css" rel="stylesheet"/>
    </head>
    <body>
    <nav id="site-nav">
    <div class="container">
    <div class="col-md-12">
    <ul class="nav nav-tabs">
    <li id="nav-homepage">
    <a class="nav-link hidden-sm hidden-xs" href="/">
    <img id="nav-logo" src="/static/images/scraper-icon.png"/>
                                    Scrape This Site
                                </a>
    </li>
    <li id="nav-sandbox">
    <a class="nav-link" href="/pages/">
    <i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>
                                    Sandbox
                                </a>

WHAT A MESS!

The information in soup is ALL of the code and unless you are awesome at reading HTML, this is indecipherable. We need to be able to find specific parts of this to extract the data.

Extracting data from HTML:

We will use the soup.find_all() function.

Here is the simplified function signature:

soup.find_all(name=None,attrs={})

You can type soup.find_all? and run it to see all the information about additional arguments and advanced processes.

Here is how we will mostly use it, but there are much more advanced things you can do:

soup.find_all( <type of section>, <info> )

The .find_all() function searches through the information in soup to match and return only sections that match the info. Here are some important types you might search for:

‘h2’ - this is a heading
‘div’ - this divides a block of information
‘span’ - this divides inline information
‘a’ - this specifies a hyperlink
‘li’ - this is a list item
class_= - many things have the class label (notice the underscore!)
string= - you can also search by strings.

Using Developer tools:

To figure out what data to extract I suggest you use developer tools on the website to find what you need. Navigate to the website:

Scrape This Site - Website

I really like Brave Browser or Google Chrome for this, but most browsers with have More Tools/Developer Tools where you can see the code.

Using developer tools you can decide what you are looking for. Let’s say I want the country information, here is an examle of the code that contains the country information:

                <div class="col-md-4 country">
                    <h3 class="country-name">
                        <i class="flag-icon flag-icon-ad"></i>
                        Andorra
                    </h3>
                    <div class="country-info">
                        <strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br>
                        <strong>Population:</strong> <span class="country-population">84000</span><br>
                        <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br>
                    </div>
                </div>

So I can use soup.find_all() to find pieces of this information!

Search for all the country names

The names of the country are inside

    <h3 class="country-name">

So lets search for this:

result = soup.find_all('h3',class_="country-name")
#print(result)

This is still a mess

[<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
                            Andorra
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-ae"></i>
                            United Arab Emirates
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-af"></i>
                            Afghanistan
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-ag"></i>
                            Antigua and Barbuda
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-ai"></i>
                            Anguilla
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-al"></i>
                            Albania

We can see the country names but they are surrounded by other junk. Here is how we will handle this.

We will start and EMPTY data frame using
```
DF = pd.DataFrame()
```
We will add our soup.find_all results as a column of the data frame
We will fix the data to strip off all of the unneeded text.

How to get the text

What this returns is a list of all the text that is inside a link block of code. If we just want to look at the text we can!

result[0]

<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
                            Andorra
                        </h3>

result[0].text

'\n\n                            Andorra\n                        '

result[0].text.lstrip().rstrip()

'Andorra'

Put this into a data frame

Now that we see what wee need to do to get the data in the form we want it, we can use a data frame and a lambda to get the data in a nice format.

DF = pd.DataFrame()
DF['country']=result
DF['country'] =DF['country'].apply(lambda x: x.text.rstrip().lstrip())
DF

	country
0	Andorra
1	United Arab Emirates
2	Afghanistan
3	Antigua and Barbuda
4	Anguilla
...	...
245	Yemen
246	Mayotte
247	South Africa
248	Zambia
249	Zimbabwe

250 rows × 1 columns

Lets try again and this time add the country capital

If we look at the code we see the country capital is inside code that looks like this:

<strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br>

so lets search for a span where class is “country-capital”

result = soup.find_all('span',class_="country-capital")

result[0]

<span class="country-capital">Andorra la Vella</span>

result[0].text

'Andorra la Vella'

DF['capital']=result
DF['capital'] = DF['capital'].apply(lambda x: x.text)
DF

	country	capital
0	Andorra	Andorra la Vella
1	United Arab Emirates	Abu Dhabi
2	Afghanistan	Kabul
3	Antigua and Barbuda	St. John's
4	Anguilla	The Valley
...	...	...
245	Yemen	Sanaa
246	Mayotte	Mamoudzou
247	South Africa	Pretoria
248	Zambia	Lusaka
249	Zimbabwe	Harare

250 rows × 2 columns

Your turn to try this!

Q Now it’s your turn. See if you can write code that gets the population and area information into the data frame. See if you can make your example match what I get below, including having the correct data types. Population should be an int and area should be a float.

Your goal is to get a data frame that looks like this:

	country	capital	population	area
0	Andorra	Andorra la Vella	84000	468.0
1	United Arab Emirates	Abu Dhabi	4975593	82880.0
2	Afghanistan	Kabul	29121286	647500.0
3	Antigua and Barbuda	St. John's	86754	443.0
4	Anguilla	The Valley	13254	102.0
...	...	...	...	...
245	Yemen	Sanaa	23495361	527970.0
246	Mayotte	Mamoudzou	159042	374.0
247	South Africa	Pretoria	49000000	1219912.0
248	Zambia	Lusaka	13460305	752614.0
249	Zimbabwe	Harare	11651858	390580.0

250 rows × 4 columns

DF.dtypes

country        object
capital        object
population      int64
area          float64
dtype: object

Where can you get more practice

Here is a website dedicated to allowing students to practice webscraping:

www.scrapethissite.com

There are “sandbox” websites that are indended for scraping. The only thing the site asks is:

Please be Well-Behaved

Just like any site you’d scrape out in the wild wild west (www), please be mindful of other users trying to access the site. From a technical standpoint, you must observe the following rules:

Clients may only make a maximum of one request per second
Clients must send an identifying user agent
Clients must respect this site’s robots.txt file

Any client that violates the rules above or otherwise tries to interfere with the site’s operation will be subject to a temporary or permanent ban.

Be a good web scraping citizen.

You try scraping (with help)

We are going to scrape the site

www.scrapethissite.com

to get the list of all links. Open the website and look at the developer tools. You should see something like:

<div class="page">
    <h3 class="page-title">
        <a href="/pages/simple/">Countries of the World: A Simple Example</a>
    </h3>
    <p class="lead session-desc">
        A single page that lists information about all the countries in the world. Good for those just get started with web scraping.
    </p>
    <hr>
</div>

for each link the the website.

Here is our goal: Make a pandas data frame that contains three columns:

“site_name” - which contains just the words of the link
“link” - which contains just the website part of the link
“description” - which contains the words below the link

When looking for links you can use

result = soup.find_all('a')
result[0].text # Get the text associated with the link
result[0].get('href') # Get the link location

Exercise 1:

See if you can do this without looking at my code in the notes! What would your plan have to be?

Hint - Here is my plan

Get the soup and find_all to get links (notice the first few entries are not helpful)
make an empty data frame
add a new column to contain the link text
fix that text so it looks nice
add a new column to contain the actual link
fix that text so it looks nice
do another find all to find the words
add a new column to contain the words
fix the text so it looks nice

The answer (below)

but seriously try this on your own

before

you

look at

the

ANSWER

# Get the soup
website = 'https://www.scrapethissite.com/pages/'
raw_code = requests.get(website)
html_doc = raw_code.text
soup = BeautifulSoup(html_doc, 'html.parser')

# Search the soup for links
result = soup.find_all('a')

# Look at what we end up with and fix it till it looks good
result[0].text.lstrip().rstrip()

'Scrape This Site'

# The first few are from the top of the page not the main content
result[5].text.lstrip().rstrip()

'Countries of the World: A Simple Example'

Notice:

I don’t get the links I want until I get the the fifth result…. I might have to remove these first few rows from my data eventually!

# Create a data frame and add the site names
DF = pd.DataFrame()
DF['site_name'] = result
DF['site_name'] = DF['site_name'].apply(lambda x: x.text.lstrip().rstrip())
DF

	site_name
0	Scrape This Site
1	Sandbox
2	Lessons
3	FAQ
4	Login
5	Countries of the World: A Simple Example
6	Hockey Teams: Forms, Searching and Pagination
7	Oscar Winning Films: AJAX and Javascript
8	Turtles All the Way Down: Frames & iFrames
9	Advanced Topics: Real World Challenges You'll ...

# Look at what we ended up with but this time get the links
result[0].get('href')

'/'

result[5].get('href')

'/pages/simple/'

# Add these to the data frame
DF['link'] = result
DF['link'] = DF['link'].apply(lambda x: x.get('href'))
DF

	site_name	link
0	Scrape This Site	/
1	Sandbox	/pages/
2	Lessons	/lessons/
3	FAQ	/faq/
4	Login	/login/
5	Countries of the World: A Simple Example	/pages/simple/
6	Hockey Teams: Forms, Searching and Pagination	/pages/forms/
7	Oscar Winning Films: AJAX and Javascript	/pages/ajax-javascript/
8	Turtles All the Way Down: Frames & iFrames	/pages/frames/
9	Advanced Topics: Real World Challenges You'll ...	/pages/advanced/

# EXTRA - lets add the base website info
base_website = 'https://www.scrapethissite.com'
DF['link'] = DF['link'].apply(lambda x: base_website+x)
DF

	site_name	link
0	Scrape This Site	https://www.scrapethissite.com/
1	Sandbox	https://www.scrapethissite.com/pages/
2	Lessons	https://www.scrapethissite.com/lessons/
3	FAQ	https://www.scrapethissite.com/faq/
4	Login	https://www.scrapethissite.com/login/
5	Countries of the World: A Simple Example	https://www.scrapethissite.com/pages/simple/
6	Hockey Teams: Forms, Searching and Pagination	https://www.scrapethissite.com/pages/forms/
7	Oscar Winning Films: AJAX and Javascript	https://www.scrapethissite.com/pages/ajax-java...
8	Turtles All the Way Down: Frames & iFrames	https://www.scrapethissite.com/pages/frames/
9	Advanced Topics: Real World Challenges You'll ...	https://www.scrapethissite.com/pages/advanced/

# Remove the top few links
DF = DF.drop([0,1,2,3,4])
DF

	site_name	link
5	Countries of the World: A Simple Example	https://www.scrapethissite.com/pages/simple/
6	Hockey Teams: Forms, Searching and Pagination	https://www.scrapethissite.com/pages/forms/
7	Oscar Winning Films: AJAX and Javascript	https://www.scrapethissite.com/pages/ajax-java...
8	Turtles All the Way Down: Frames & iFrames	https://www.scrapethissite.com/pages/frames/
9	Advanced Topics: Real World Challenges You'll ...	https://www.scrapethissite.com/pages/advanced/

# Now search the soup for the words
# They are inside p class="lead session-desc"
result = soup.find_all('p',class_="lead session-desc")

result[0].text.rstrip().lstrip()

'A single page that lists information about all the countries in the world. Good for those just get started with web scraping.'

DF['description'] = result
DF['description'] = DF['description'].apply(lambda x: x.text.rstrip().lstrip())
DF

	site_name	link	description
5	Countries of the World: A Simple Example	https://www.scrapethissite.com/pages/simple/	A single page that lists information about all...
6	Hockey Teams: Forms, Searching and Pagination	https://www.scrapethissite.com/pages/forms/	Browse through a database of NHL team stats si...
7	Oscar Winning Films: AJAX and Javascript	https://www.scrapethissite.com/pages/ajax-java...	Click through a bunch of great films. Learn ho...
8	Turtles All the Way Down: Frames & iFrames	https://www.scrapethissite.com/pages/frames/	Some older sites might still use frames to bre...
9	Advanced Topics: Real World Challenges You'll ...	https://www.scrapethissite.com/pages/advanced/	Scraping real websites, you're likely run into...

Challenge Problem

Here is another website to scrape. See if you can create a data frame that looks like the one below. Notice that you can only scrape the first page.

If you want to try scraping the other pages you have to notice how the website updates its address for each page. Then write a for loop to loop through how ever many pages you want to scrape. Do the same set of operations for each page keep adding data to your data frame.

Make a histogram of your final data.

website='https://books.toscrape.com/index.html'
raw_code = requests.get(website)
html_doc = raw_code.text
soup = BeautifulSoup(html_doc, 'html.parser')

Try to scrape the name, the link to the book, and the prices! I decided to put the name and link information into a single column and then break that apart

	names_links	links	names	price
0	[[A Light in the ...]]	catalogue/a-light-in-the-attic_1000/index.html	A Light in the ...	51.77
1	[[Tipping the Velvet]]	catalogue/tipping-the-velvet_999/index.html	Tipping the Velvet	53.74
2	[[Soumission]]	catalogue/soumission_998/index.html	Soumission	50.10
3	[[Sharp Objects]]	catalogue/sharp-objects_997/index.html	Sharp Objects	47.82
4	[[Sapiens: A Brief History ...]]	catalogue/sapiens-a-brief-history-of-humankind...	Sapiens: A Brief History ...	54.23
5	[[The Requiem Red]]	catalogue/the-requiem-red_995/index.html	The Requiem Red	22.65
6	[[The Dirty Little Secrets ...]]	catalogue/the-dirty-little-secrets-of-getting-...	The Dirty Little Secrets ...	33.34
7	[[The Coming Woman: A ...]]	catalogue/the-coming-woman-a-novel-based-on-th...	The Coming Woman: A ...	17.93
8	[[The Boys in the ...]]	catalogue/the-boys-in-the-boat-nine-americans-...	The Boys in the ...	22.60
9	[[The Black Maria]]	catalogue/the-black-maria_991/index.html	The Black Maria	52.15
10	[[Starving Hearts (Triangular Trade ...]]	catalogue/starving-hearts-triangular-trade-tri...	Starving Hearts (Triangular Trade ...	13.99
11	[[Shakespeare's Sonnets]]	catalogue/shakespeares-sonnets_989/index.html	Shakespeare's Sonnets	20.66
12	[[Set Me Free]]	catalogue/set-me-free_988/index.html	Set Me Free	17.46
13	[[Scott Pilgrim's Precious Little ...]]	catalogue/scott-pilgrims-precious-little-life-...	Scott Pilgrim's Precious Little ...	52.29
14	[[Rip it Up and ...]]	catalogue/rip-it-up-and-start-again_986/index....	Rip it Up and ...	35.02
15	[[Our Band Could Be ...]]	catalogue/our-band-could-be-your-life-scenes-f...	Our Band Could Be ...	57.25
16	[[Olio]]	catalogue/olio_984/index.html	Olio	23.88
17	[[Mesaerion: The Best Science ...]]	catalogue/mesaerion-the-best-science-fiction-s...	Mesaerion: The Best Science ...	37.59
18	[[Libertarianism for Beginners]]	catalogue/libertarianism-for-beginners_982/ind...	Libertarianism for Beginners	51.33
19	[[It's Only the Himalayas]]	catalogue/its-only-the-himalayas_981/index.html	It's Only the Himalayas	45.17

fig = px.histogram(DF_plot,x='price',color='names')

fig.show()

	Unnamed: 5	Unnamed: 22	Unnamed: 26
0	NaN	NaN	NaN
1	NaN	NaN	NaN
2	NaN	NaN	NaN
3	NaN	NaN	NaN
4	NaN	NaN	NaN
...	...	...	...
146	NaN	NaN	NaN
147	NaN	NaN	NaN
148	NaN	NaN	NaN
149	NaN	NaN	NaN
150	NaN	NaN	NaN

	Unnamed: 5	Unnamed: 22	Unnamed: 26
0	NaN	NaN	NaN
1	NaN	NaN	NaN
2	NaN	NaN	NaN
3	NaN	NaN	NaN
4	NaN	NaN	NaN
...	...	...	...
146	NaN	NaN	NaN
147	NaN	NaN	NaN
148	NaN	NaN	NaN
149	NaN	NaN	NaN
150	NaN	NaN	NaN

	Unnamed: 5	Unnamed: 22	Unnamed: 26
0	NaN	NaN	NaN
1	NaN	NaN	NaN
2	NaN	NaN	NaN
3	NaN	NaN	NaN
4	NaN	NaN	NaN
...	...	...	...
146	NaN	NaN	NaN
147	NaN	NaN	NaN
148	NaN	NaN	NaN
149	NaN	NaN	NaN
150	NaN	NaN	NaN

	Unnamed: 5	Unnamed: 22	Unnamed: 26
0	NaN	NaN	NaN
1	NaN	NaN	NaN
2	NaN	NaN	NaN
3	NaN	NaN	NaN
4	NaN	NaN	NaN
...	...	...	...
146	NaN	NaN	NaN
147	NaN	NaN	NaN
148	NaN	NaN	NaN
149	NaN	NaN	NaN
150	NaN	NaN	NaN

	Unnamed: 5	Unnamed: 22	Unnamed: 26
0	NaN	NaN	NaN
1	NaN	NaN	NaN
2	NaN	NaN	NaN
3	NaN	NaN	NaN
4	NaN	NaN	NaN
...	...	...	...
146	NaN	NaN	NaN
147	NaN	NaN	NaN
148	NaN	NaN	NaN
149	NaN	NaN	NaN
150	NaN	NaN	NaN

	Unnamed: 5	Unnamed: 22	Unnamed: 26
0	NaN	NaN	NaN
1	NaN	NaN	NaN
2	NaN	NaN	NaN
3	NaN	NaN	NaN
4	NaN	NaN	NaN
...	...	...	...
146	NaN	NaN	NaN
147	NaN	NaN	NaN
148	NaN	NaN	NaN
149	NaN	NaN	NaN
150	NaN	NaN	NaN