Data Scraping is the process of collecting and parsing raw data from the Web. It is a really useful skill for Data Analysts and one of the most efficient ways to get data from the web, and in some cases to channel that data to another website. In this notebook we are going to scrape data with different methods and for different types of web pages. For a more detailed documentation check the articles on medium that I wrote explaining my process in depth:
Scraping Data From Static Website Using Beautiful Soup
Data Scraping in a Dynamic Web Page with Python and Selenium
API-Based Web Scraping Using Python
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
There are many data scraping types and they depend from the type of website you want to scrape your data from or from the type of data that you want to collect (like images or tables etc.). In this notebook we will use 3 different data scraping methods.
1. Static Web Scraping : It is used in static websites. A static website has a fixed HTML code and does not change.
2. Dynamic Web Scrabing : It is used in dynamic websites. A dynamic website use Javascript to update content in real-time.
3. API-Based Web Scraping : It is a data extraction tool that is designed for specific websites, databases or programs
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("https://www.worldometers.info/population/countries-in-europe-by-population/").text)
table = soup.find("table", {"id": "example2"})
header = []
rows = []
for i, row in enumerate(table.find_all("tr")):
if i == 0:
header = [el.text.strip() for el in row.find_all("th")]
else:
rows.append([el.text.strip() for el in row.find_all("td")])
Now, let's check if our data appears right into a dataframe and then we are going to save them into a csv file.
#convert our list to dataframe
df = pd.DataFrame(rows, columns=header)
df.head()
df.info()
#save dataframe to csv file
df.to_csv("population_in_Europe.csv", header=header, index=False)
The library that we are going to use to scrape our page is Selenium. Selenium is a Python Library which can automate loading and rendering websites in a browser like Chrome or Firefox.
Tip: Before you start using the code below, keep in mind that you should run it locally (I used Jupyter Notebook from Anaconda). That's why I am going to write it here as comments and not as a code.
In our first attempt we are going to get the entire website content.
# install required libraries
# %pip install selenium
# %pip install webdriver-manager
# from selenium import webdriver
# from selenium.webdriver.chrome.service import Service as ChromeService
# from webdriver_manager.chrome import ChromeDriverManager
# load website
# url = 'https://angular.dev'
# instantiate driver
# driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
# get the entire website content
# driver.get(url)
# print(driver.page_source)
Now let’s attempt something different. We are going to save into a directory all the links the webpage has.
# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.chrome.service import Service as ChromeService
# from webdriver_manager.chrome import ChromeDriverManager
# instantiate options
# options = webdriver.ChromeOptions()
# run browser in headless mode
# options.headless = True
# instantiate driver
# driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
# load website
# url = 'https://angular.dev'
# get the entire website content
# driver.get(url)
# create an empty dictionary
# link={}
# create a parameter for the dictionary
# j=0
# select elements by tag name
# a_elements = driver.find_elements(By.TAG_NAME, "a")
# for i in a_elements:
# select link, within element
# link[j] = i.get_attribute("href")
# print links
# print(link[j])
# j = j + 1
In this example we will gather some data about different countries from wikipedia, as it has many relevant informations. We will use the MediaWiki API to scrape those data.
# install required tools
%pip install wptools
%pip install wikipedia
%pip install wordcloud
%pip install chardet
# import required libraries
import json
import wptools
import wikipedia
import pandas as pd
import chardet
# checking the installed version
print('wptools version : {}'.format(wptools.__version__))
I am going to use a list with all the countries names. I found it here. You can easily find datasets like this in many repositories freely. I then upload it in my notebook but I couldn't read it with pandas. Perhaps the coding was different. So, I use chardet library in order to read this csv file.
The chardet library reads the file in binary mode and tries to detect the encoding format based on the byte sequence in the file. Once it detects the encoding format, it passes it to the encoding parameter in the pd.read_csv() function.
# read csv file
with open('/kaggle/input/list-of-countries/list-of-countries.csv', 'rb') as f:
result = chardet.detect(f.read())
df = pd.read_csv('/kaggle/input/list-of-countries/list-of-countries.csv', encoding=result['encoding'])
df.head()
This list is only for learning purposes so, I am going to use only the first 10 countries from my list. But go ahead and use it all if you want.
# no of countries we are interested
No_countries = 10
# only selecting the first 10 countries of the list
df_10 = df.iloc[:No_countries, :].copy()
# converting the column to a list
countries = df_10['Common Name'].tolist()
# looping through the list of 10 countries
for i, j in enumerate(countries):
print('{}. {}'.format(i+1, j))
One issue with matching the 10 countries from our list to their wikipedia article names is that both of them would not be exactly the same i.e. they match character for character. There will be slight variation in their names.
To overcome this problem and ensure that we have all the countries names and its corresponding wikipedia article, we will use the wikipedia package to get suggestions for the countries names and their equivalent in wikipedia.
# searching for suggestions in wikipedia
wiki_search = [{country : wikipedia.search(country)} for country in countries]
for idx, country in enumerate(wiki_search):
for i, j in country.items():
print('{}. {} :\n{}'.format(idx+1, i ,', '.join(j)))
print('\n')
Now let's get the most probable ones (the first suggestion) for each of the first 10 countries.
most_probable = [(country, wiki_search[i][country][0]) for i, country in enumerate(countries)]
countries = [x[1] for x in most_probable]
print('Most Probable: \n', most_probable,'\n')
# print final list
print('Final List: \n', countries)
Now that we have the names of the countries to their corresponding wikipedia article let's retrieve the infobox data from those pages.
wptools provides easy to use methods to directly call the MediaWiki API on our behalf and get us all the wikipedia data. Let's try retrieving data for Afghanistan, the first name of the list.
# parses the wikipedia article
page = wptools.page('Afghanistan')
page.get_parse()
page.data.keys()
As we can see from the output above, wptools successfully retrieved the wikipedia and wikidata corresponding to the query Afghanistan. We only want data from the infobox. Let's retrieve them and see what we are going to keep.
page.data['infobox']
We will define a list of features that we want from the infoboxes as follows
# create an empty list
wiki_data = []
# attributes of interest contained within the wiki infoboxes
features = ['population_estimate', 'population_estimate_year', 'GDP_PPP', 'GDP_PPP_year', 'GDP_PPP_rank', 'GDP_PPP_per_capita',
'GDP_PPP_per_capita_rank']
# fetching the data for all 10 countries
for country in countries:
page = wptools.page(country) # create a page object
try:
page.get_parse() # call the API and parse the data
if page.data['infobox'] != None:
# if infobox is present
infobox = page.data['infobox']
# get data for the interested features/attributes
data = { feature : infobox[feature] if feature in infobox else ''
for feature in features }
else:
data = { feature : '' for feature in features }
data['country_name'] = country
wiki_data.append(data)
except KeyError:
pass
# checking the first entity of our list
wiki_data[0]
And last part, save it into a json file for later use. Next step offcourse wrangling and cleaning our dataset. Enjoy!
with open('infoboxes.json', 'w') as file:
json.dump(wiki_data, file)
There are many misinformations about web scraping. Many people believe it is illegal to use it but that only occurs when you try to scrape personal informations like a private message of a person. There are some rules when you are scraping data.
Consult Legal Advice: When in doubt, make sure you consult with a legal counsel who specializes in technology and internet law. They can provide advice specific to your situation and jurisdiction.
https://blog.gopenai.com/web-scraping-with-python-part-1-simple-static-website-84e0bddd5acd
https://blog.gopenai.com/web-scraping-with-python-part-2-dynamic-website-e0364a89e058
https://saturncloud.io/blog/how-to-fix-the-pandas-unicodedecodeerror-utf8-codec-cant-decode-bytes-in-position-01-invalid-continuation-byte-error/
https://monashdatafluency.github.io/python-web-scraping/section-3-API-based-scraping/