Data Scraping Methods with Examples

Data Scraping is the process of collecting and parsing raw data from the Web. It is a really useful skill for Data Analysts and one of the most efficient ways to get data from the web, and in some cases to channel that data to another website. In this notebook we are going to scrape data with different methods and for different types of web pages. For a more detailed documentation check the articles on medium that I wrote explaining my process in depth:

Scraping Data From Static Website Using Beautiful Soup
Data Scraping in a Dynamic Web Page with Python and Selenium
API-Based Web Scraping Using Python

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/list-of-countries/list-of-countries.csv

There are many data scraping types and they depend from the type of website you want to scrape your data from or from the type of data that you want to collect (like images or tables etc.). In this notebook we will use 3 different data scraping methods.
1. Static Web Scraping : It is used in static websites. A static website has a fixed HTML code and does not change.
2. Dynamic Web Scrabing : It is used in dynamic websites. A dynamic website use Javascript to update content in real-time.
3. API-Based Web Scraping : It is a data extraction tool that is designed for specific websites, databases or programs

1. Scraping Data From Static Website Using Beautiful Soup

We are going to scrape data from worldometers.info
For the exact page click here. It would be easier for the reader to have basic knowledge for HTML structure.

What are we going to do with the code below?

  • Make a soup element in which we are going to have all the HTML code of our page
  • Detect the table in the soup, and save it in an list named "table"
  • We are going to use 2 empty lists. One for the header and one for all the rows of the table.
  • We will use a for loop in order to save our data in the empty lists.
In [2]:
from bs4 import BeautifulSoup
import requests 

soup = BeautifulSoup(requests.get("https://www.worldometers.info/population/countries-in-europe-by-population/").text)
table = soup.find("table", {"id": "example2"})
header = []
rows = []
for i, row in enumerate(table.find_all("tr")):
    if i == 0:
        header = [el.text.strip() for el in row.find_all("th")]
    else:
        rows.append([el.text.strip() for el in row.find_all("td")])

Now, let's check if our data appears right into a dataframe and then we are going to save them into a csv file.

In [3]:
#convert our list to dataframe
df = pd.DataFrame(rows, columns=header)
df.head()
Out[3]:
# Country (or dependency) Population (2024) Yearly Change Net Change Density (P/Km²) Land Area (Km²) Migrants (net) Fert. Rate Med. Age Urban Pop % World Share
0 1 Russia 144,820,423 -0.43 % -620,077 9 16,376,870 -178,042 1.5 40 75 % 1.77 %
1 2 Germany 84,552,242 0.00 % 4,011 243 348,560 36,954 1.4 45 76 % 1.04 %
2 3 United Kingdom 69,138,192 0.66 % 455,230 286 241,930 417,114 1.6 40 84 % 0.85 %
3 4 France 66,548,530 0.17 % 109,708 122 547,557 90,527 1.6 42 82 % 0.82 %
4 5 Italy 59,342,867 -0.26 % -156,586 202 294,140 95,246 1.2 48 72 % 0.73 %
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   #                        47 non-null     object
 1   Country (or dependency)  47 non-null     object
 2   Population (2024)        47 non-null     object
 3   Yearly Change            47 non-null     object
 4   Net Change               47 non-null     object
 5   Density (P/Km²)          47 non-null     object
 6   Land Area (Km²)          47 non-null     object
 7   Migrants (net)           47 non-null     object
 8   Fert. Rate               47 non-null     object
 9   Med. Age                 47 non-null     object
 10  Urban Pop %              47 non-null     object
 11  World Share              47 non-null     object
dtypes: object(12)
memory usage: 4.5+ KB
In [5]:
#save dataframe to csv file
df.to_csv("population_in_Europe.csv", header=header, index=False)

2. Data Scraping in a Dynamic Web Page with Selenium

The library that we are going to use to scrape our page is Selenium. Selenium is a Python Library which can automate loading and rendering websites in a browser like Chrome or Firefox.

Tip: Before you start using the code below, keep in mind that you should run it locally (I used Jupyter Notebook from Anaconda). That's why I am going to write it here as comments and not as a code.

In our first attempt we are going to get the entire website content.

In [6]:
# install required libraries
# %pip install selenium
# %pip install webdriver-manager

# from selenium import webdriver 
# from selenium.webdriver.chrome.service import Service as ChromeService 
# from webdriver_manager.chrome import ChromeDriverManager 
 
# load website
# url = 'https://angular.dev' 

# instantiate driver 
# driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install())) 
# get the entire website content
# driver.get(url) 
 
# print(driver.page_source)

Now let’s attempt something different. We are going to save into a directory all the links the webpage has.

In [7]:
# from selenium import webdriver 
# from selenium.webdriver.common.by import By 
# from selenium.webdriver.chrome.service import Service as ChromeService 
# from webdriver_manager.chrome import ChromeDriverManager 
 
# instantiate options 
# options = webdriver.ChromeOptions() 
 
# run browser in headless mode 
# options.headless = True 
 
# instantiate driver 
# driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options) 
 
# load website 
# url = 'https://angular.dev' 
 
# get the entire website content 
# driver.get(url) 
# create an empty dictionary
# link={}

# create a parameter for the dictionary
# j=0

# select elements by tag name 
# a_elements = driver.find_elements(By.TAG_NAME, "a")
# for i in a_elements: 
 # select link, within element
 # link[j] = i.get_attribute("href")
 # print links 
 # print(link[j])
 # j = j + 1

3. Data Scrapping Using an API

In this example we will gather some data about different countries from wikipedia, as it has many relevant informations. We will use the MediaWiki API to scrape those data.

In [8]:
# install required tools
%pip install wptools
%pip install wikipedia
%pip install wordcloud
%pip install chardet
Collecting wptools
  Obtaining dependency information for wptools from https://files.pythonhosted.org/packages/e2/5c/0d8af5532e44477edeb3dac81d3a611ea75827a18b6b4068c3cc2188bfe5/wptools-0.4.17-py2.py3-none-any.whl.metadata
  Downloading wptools-0.4.17-py2.py3-none-any.whl.metadata (14 kB)
Requirement already satisfied: certifi in /opt/conda/lib/python3.10/site-packages (from wptools) (2023.11.17)
Collecting html2text (from wptools)
  Downloading html2text-2024.2.26.tar.gz (56 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 kB 3.2 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... - done
Requirement already satisfied: lxml in /opt/conda/lib/python3.10/site-packages (from wptools) (5.1.0)
Collecting pycurl (from wptools)
  Obtaining dependency information for pycurl from https://files.pythonhosted.org/packages/64/d2/a4c45953aed86f5a0c9717421dd725ec61acecd63777dd71dfe3d50d3e16/pycurl-7.45.3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata
  Downloading pycurl-7.45.3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (4.3 kB)
Downloading wptools-0.4.17-py2.py3-none-any.whl (38 kB)
Downloading pycurl-7.45.3-cp310-cp310-manylinux_2_28_x86_64.whl (4.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.6/4.6 MB 50.6 MB/s eta 0:00:00
Building wheels for collected packages: html2text
  Building wheel for html2text (setup.py) ... - \ done
  Created wheel for html2text: filename=html2text-2024.2.26-py3-none-any.whl size=33110 sha256=e3a1dab1c6814121534f5a3090897f19d214ce89228f0f4563826100bf9a8f64
  Stored in directory: /root/.cache/pip/wheels/f3/96/6d/a7eba8f80d31cbd188a2787b81514d82fc5ae6943c44777659
Successfully built html2text
Installing collected packages: pycurl, html2text, wptools
Successfully installed html2text-2024.2.26 pycurl-7.45.3 wptools-0.4.17
Note: you may need to restart the kernel to use updated packages.
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... - done
Requirement already satisfied: beautifulsoup4 in /opt/conda/lib/python3.10/site-packages (from wikipedia) (4.12.2)
Requirement already satisfied: requests<3.0.0,>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from wikipedia) (2.31.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2023.11.17)
Requirement already satisfied: soupsieve>1.2 in /opt/conda/lib/python3.10/site-packages (from beautifulsoup4->wikipedia) (2.3.2.post1)
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... - \ done
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11678 sha256=de134084d8fe39645f958e008b370f2bd34e86e245f3807d2e080813b3ca87d0
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: wordcloud in /opt/conda/lib/python3.10/site-packages (1.9.2)
Requirement already satisfied: numpy>=1.6.1 in /opt/conda/lib/python3.10/site-packages (from wordcloud) (1.24.3)
Requirement already satisfied: pillow in /opt/conda/lib/python3.10/site-packages (from wordcloud) (9.5.0)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.10/site-packages (from wordcloud) (3.7.4)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (4.42.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (21.3)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.10/site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
Collecting chardet
  Obtaining dependency information for chardet from https://files.pythonhosted.org/packages/38/6f/f5fbc992a329ee4e0f288c1fe0e2ad9485ed064cac731ed2fe47dcc38cbf/chardet-5.2.0-py3-none-any.whl.metadata
  Downloading chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Downloading chardet-5.2.0-py3-none-any.whl (199 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.4/199.4 kB 5.1 MB/s eta 0:00:00
Installing collected packages: chardet
Successfully installed chardet-5.2.0
Note: you may need to restart the kernel to use updated packages.
In [9]:
# import required libraries

import json
import wptools
import wikipedia
import pandas as pd
import chardet

# checking the installed version
print('wptools version : {}'.format(wptools.__version__))
wptools version : 0.4.17

I am going to use a list with all the countries names. I found it here. You can easily find datasets like this in many repositories freely. I then upload it in my notebook but I couldn't read it with pandas. Perhaps the coding was different. So, I use chardet library in order to read this csv file.

The chardet library reads the file in binary mode and tries to detect the encoding format based on the byte sequence in the file. Once it detects the encoding format, it passes it to the encoding parameter in the pd.read_csv() function.

In [10]:
# read csv file
with open('/kaggle/input/list-of-countries/list-of-countries.csv', 'rb') as f:
    result = chardet.detect(f.read())
    
df = pd.read_csv('/kaggle/input/list-of-countries/list-of-countries.csv', encoding=result['encoding'])            
df.head()
Out[10]:
Common Name Formal Name (if different)
0 Afghanistan Islamic Republic of Afghanistan
1 Albania Republic of Albania
2 Algeria People's Democratic Republic of Algeria
3 Andorra Principality of Andorra
4 Angola Republic of Angola

This list is only for learning purposes so, I am going to use only the first 10 countries from my list. But go ahead and use it all if you want.

In [11]:
# no of countries we are interested
No_countries = 10 

# only selecting the first 10 countries of the list                        
df_10 = df.iloc[:No_countries, :].copy()
# converting the column to a list 
countries = df_10['Common Name'].tolist()
In [12]:
# looping through the list of 10 countries
for i, j in enumerate(countries):   
    print('{}. {}'.format(i+1, j))
1. Afghanistan
2. Albania
3. Algeria
4. Andorra
5. Angola
6. Antigua and Barbuda 
7. Argentina
8. Armenia
9. Australia
10. Austria

One issue with matching the 10 countries from our list to their wikipedia article names is that both of them would not be exactly the same i.e. they match character for character. There will be slight variation in their names.

To overcome this problem and ensure that we have all the countries names and its corresponding wikipedia article, we will use the wikipedia package to get suggestions for the countries names and their equivalent in wikipedia.

In [13]:
# searching for suggestions in wikipedia
wiki_search = [{country : wikipedia.search(country)} for country in countries]

for idx, country in enumerate(wiki_search):
    for i, j in country.items():
        print('{}. {} :\n{}'.format(idx+1, i ,', '.join(j)))
        print('\n')
1. Afghanistan :
Afghanistan, War in Afghanistan (2001–2021), Soviet–Afghan War, Afghan, Afghans, Provinces of Afghanistan, President of Afghanistan, Economy of Afghanistan, Taliban, History of Afghanistan


2. Albania :
Albania, Albanians, History of Albania, Albanian language, Albania national football team, Albanian, People's Socialist Republic of Albania, Flag of Albania, Caucasian Albania, Italian protectorate of Albania (1939–1943)


3. Algeria :
Algeria, Algerian War, French Algeria, Constantine, Algeria, Algeria national football team, Algerians, Communes of Algeria, List of cities in Algeria, Algerian, Women in Algeria


4. Andorra :
Andorra, Andorran, Andorra la Vella, FC Andorra, Co-princes of Andorra, Andorra national football team, Andorra (disambiguation), History of Andorra, Andorra (album), Economy of Andorra


5. Angola :
Angola, Angolan, Louisiana State Penitentiary, Angolan Civil War, President of Angola, Provinces of Angola, MPLA, List of cities and towns in Angola, Angola national football team, TAAG Angola Airlines


6. Antigua and Barbuda  :
Antigua and Barbuda, Antigua, St. John's, Antigua and Barbuda, Flag of Antigua and Barbuda, Barbuda, Antigua and Barbuda national football team, History of Antigua and Barbuda, Monarchy of Antigua and Barbuda, Music of Antigua and Barbuda, Prime Minister of Antigua and Barbuda


7. Argentina :
Argentina, Argentines, Argentina national football team, Buenos Aires, Argentine Primera División, Greater argentine, Provinces of Argentina, Time in Argentina, Argentine Navy, Demographics of Argentina


8. Armenia :
Armenia, Armenians, Armenian Apostolic Church, Armenian, Nagorno-Karabakh conflict, History of Armenia, Armenian language, Kingdom of Armenia (antiquity), Armenian genocide, Foreign relations of Armenia


9. Australia :
Australia, Western Australia, Indigenous Australians, South Australia, Australia national cricket team, History of Australia, States and territories of Australia, Paramount Networks UK & Australia, Australians, List of the busiest airports in Australia


10. Austria :
Austria, Austria-Hungary, Austrian Empire, 2024 Austrian legislative election, Fugging, NEOS (Austria), Austrians, List of cities and towns in Austria, History of Austria, Habsburg monarchy


Now let's get the most probable ones (the first suggestion) for each of the first 10 countries.

In [14]:
most_probable = [(country, wiki_search[i][country][0]) for i, country in enumerate(countries)]
countries = [x[1] for x in most_probable]

print('Most Probable: \n', most_probable,'\n')

# print final list
print('Final List: \n', countries)
Most Probable: 
 [('Afghanistan', 'Afghanistan'), ('Albania', 'Albania'), ('Algeria', 'Algeria'), ('Andorra', 'Andorra'), ('Angola', 'Angola'), ('Antigua and Barbuda ', 'Antigua and Barbuda'), ('Argentina', 'Argentina'), ('Armenia', 'Armenia'), ('Australia', 'Australia'), ('Austria', 'Austria')] 

Final List: 
 ['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria']

Now that we have the names of the countries to their corresponding wikipedia article let's retrieve the infobox data from those pages.

wptools provides easy to use methods to directly call the MediaWiki API on our behalf and get us all the wikipedia data. Let's try retrieving data for Afghanistan, the first name of the list.

In [15]:
# parses the wikipedia article
page = wptools.page('Afghanistan')
page.get_parse()   
page.data.keys()
en.wikipedia.org (parse) Afghanistan
Afghanistan (en) data
{
  infobox: <dict(79)> conventional_long_name, common_name, native_...
  iwlinks: <list(11)> https://commons.wikimedia.org/wiki/%D8%A7%D9...
  pageid: 737
  parsetree: <str(403055)> <root><template><title>Short descriptio...
  requests: <list(1)> parse
  title: Afghanistan
  wikibase: Q889
  wikidata_url: https://www.wikidata.org/wiki/Q889
  wikitext: <str(326086)> {{Short description|Country in Central A...
}
Out[15]:
dict_keys(['requests', 'iwlinks', 'pageid', 'wikitext', 'parsetree', 'infobox', 'title', 'wikibase', 'wikidata_url'])

As we can see from the output above, wptools successfully retrieved the wikipedia and wikidata corresponding to the query Afghanistan. We only want data from the infobox. Let's retrieve them and see what we are going to keep.

In [16]:
page.data['infobox']
Out[16]:
{'conventional_long_name': 'Islamic Emirate of Afghanistan',
 'common_name': 'Afghanistan',
 'native_name': '{{unbulleted list|native name|ps|د افغانستان اسلامي امارت|italic|=|no|<br />|small|transliteration|ps|Də Afġānistān Islāmī Imārat|native name|prs|امارت اسلامی افغانستان|italic|=|no|<br />|small|transliteration|prs|Imārat-i Islāmī-yi Afğānistān}} {{native name|ps|د افغانستان اسلامي امارت|italic|=|no}} <br /> {{small|transliteration|ps|Də Afġānistān Islāmī Imārat}} {{transliteration|ps|Də Afġānistān Islāmī Imārat}} {{native name|prs|امارت اسلامی افغانستان|italic|=|no}} <br /> {{small|transliteration|prs|Imārat-i Islāmī-yi Afğānistān}} {{transliteration|prs|Imārat-i Islāmī-yi Afğānistān}}',
 'image_flag': 'Flag of Taliban.svg',
 'flag_caption': 'Flag',
 'image_coat': 'Arms of the Islamic Emirate of Afghanistan.svg',
 'alt_coat': 'Coat of Arms of the Islamic Emirate',
 'symbol_type': '[[Emblem of Afghanistan|Emblem]]',
 'national_motto': '{{lang|ar|لا إله إلا الله، محمد رسول الله}} <br/> {{transliteration|ar|Lā ʾilāha ʾillā llāh, Muhammadun rasūlu llāh}} <br/>\n"There is no god but [[God in Islam|Allah]]; [[Muhammad]] is the messenger of Allah." (\'\'[[Shahadah]]\'\')',
 'national_anthem': '{{lang|ps|دا د باتورانو کور}} <br />" {{transliteration|ps|Dā Də Bātorāno Kor}} "<br />"[[This Is the Home of the Brave]]" <br>',
 'image_map': "{{switcher|[[File:Afghanistan (orthographic projection).svg|upright=1.15|frameless]]|Afghanistan on the globe|[[File:Afghanistan - Location Map (2013) - AFG - UNOCHA.svg|upright=1.15|frameless]]|Afghanistan's neighbors and towns}}",
 'capital': '[[Kabul]]',
 'coordinates': '{{Coord|34|31|N|69|11|E|region:AF_source:geonames|display|=|inline,title}}',
 'largest_city': 'Kabul',
 'official_languages': '{{hlist|[[Pashto]]|[[Dari]]}}',
 'ethnic_groups': '{{unbulleted list\n | 42% [[Pashtun]]\n | 27% [[Tajiks|Tajik]]\n | |figure space|9% [[Hazaras|Hazara]]\n | |figure space|9% [[Uzbeks|Uzbek]]\n | |figure space|4% [[Aimaq people|Aimaq]]\n | |figure space|3% [[Turkmen people|Turkmen]]\n | |figure space|2% [[Baloch people|Baloch]]\n | |figure space|4% [[Ethnic groups in Afghanistan|other]]}} {{figure space}} 9% [[Hazaras|Hazara]] {{figure space}} 9% [[Uzbeks|Uzbek]] {{figure space}} 4% [[Aimaq people|Aimaq]] {{figure space}} 3% [[Turkmen people|Turkmen]] {{figure space}} 2% [[Baloch people|Baloch]] {{figure space}} 4% [[Ethnic groups in Afghanistan|other]]',
 'ethnic_groups_ref': '{{efn|The last census in Afghanistan was conducted in 1979, and was itself incomplete. Due to the [[Afghan conflict|ongoing conflict]] in the country, no official census has been conducted since.|ref| name="Population Matters"}}',
 'ethnic_groups_year': '2019 unofficial estimates',
 'religion': '{{unbulleted list\n | 99.7% [[Islam in Afghanistan|Islam]] ([[State religion|official]])\n | 0.3% [[Demographics of Afghanistan#Religion|other]]}}',
 'religion_year': '2015',
 'demonym': '[[Afghans|Afghan]] {{Efn|Other demonyms that have been used are Afghani,|ref|Dictionary.com. [[The American Heritage Dictionary of the English Language]], Fourth Edition. Houghton Mifflin Company, 2004. [http://dictionary.reference.com/browse/afghani Reference.com] {{Webarchive|url=https://web.archive.org/web/20160303185738/http://dictionary.reference.com/browse/afghani |date=3 March 2016 }} (Retrieved 13 November 2007).|</ref>| Afghanese and Afghanistani (see [[Afghans]] for further details)|ref|Dictionary.com. [[WordNet]] 3.0. [[Princeton University]]. [http://dictionary.reference.com/browse/afghanistani Reference.com] (Retrieved 13 November 2007). {{webarchive |url=https://web.archive.org/web/20140328102257/http://dictionary.reference.com/browse/afghanistani |date=28 March 2014}}|</ref>|name|=|"Demonym"|group|=|"Note"}} Afghanese and Afghanistani (see [[Afghans]] for further details)',
 'government_type': 'Unitary [[totalitarian]] provisional [[theocratic]] Islamic [[emirate]]',
 'leader_title1': '[[Supreme Leader of Afghanistan|Supreme Leader]]',
 'leader_name1': '{{nowrap|[[Hibatullah Akhundzada]]}}',
 'leader_title2': '[[Prime Minister of Afghanistan|Prime Minister]]',
 'leader_name2': '[[Hasan Akhund]] ([[Acting prime minister|acting]])',
 'leader_title3': '[[Chief Justice of Afghanistan|Chief Justice]]',
 'leader_name3': '[[Abdul Hakim Haqqani]]',
 'legislature': "None {{efn|Afghanistan is a pure [[autocracy]], with all law ultimately originating from the supreme leader. Consensus rule was initially used among the Taliban, but was phased out as the supreme leader monopolized control in the months following the 2021 return to power.|ref|{{cite web |author1=T. S. Tirumurti |title=Letter dated 25 May 2022 from the Chair of the Security Council Committee established pursuant to resolution 1988 (2011) addressed to the President of the Security Council |url=https://digitallibrary.un.org/record/3975071/files/S_2022_419-EN.pdf?ln=en |publisher=[[United Nations Security Council]] |access-date=2 May 2023 |date=26 May 2022}}|</ref>|ref|{{cite news |last1=Kraemer |first1=Thomas |title=Afghanistan dispatch: Taliban leaders issue new orders on law-making process, enforcement of court orders from previous government |url=https://www.jurist.org/news/2022/11/afghanistan-dispatch-taliban-leaders-issue-new-orders-on-law-making-process-enforcement-of-court-orders-from-previous-government/ |access-date=1 May 2023 |work=[[JURIST]] |date=27 November 2022 |archive-date=17 January 2024 |archive-url=https://web.archive.org/web/20240117233605/https://www.jurist.org/news/2022/11/afghanistan-dispatch-taliban-leaders-issue-new-orders-on-law-making-process-enforcement-of-court-orders-from-previous-government/ |url-status=live }}|</ref>|ref|{{cite news |last1=Dawi |first1=Akmal |title=Unseen Taliban Leader Wields Godlike Powers in Afghanistan |url=https://www.voanews.com/a/unseen-taliban-leader-wields-godlike-powers-in-afghanistan-/7026112.html |access-date=1 May 2023 |publisher=[[Voice of America]] |date=28 March 2023 |archive-date=13 April 2023 |archive-url=https://web.archive.org/web/20230413041049/https://www.voanews.com/a/unseen-taliban-leader-wields-godlike-powers-in-afghanistan-/7026112.html |url-status=live }}|</ref>| There is an advisory [[Leadership Council of Afghanistan|Leadership Council]], however its role is in question as the supreme leader has not convened it for many months (|as of|lc|=|y|2023|03|post|=|),| and increasingly rules by decree.|ref|{{cite journal |author1=Oxford Analytica |author1-link=Oxford Analytica |title=Senior Afghan Taliban figures move to curb leader |journal=Expert Briefings |series=Emerald Expert Briefings |date=10 March 2023 |volume=oxan-db |issue=oxan-db |doi=10.1108/OXAN-DB276639 |quote=[Akhundzada] has not convened the Taliban's Leadership Council (a 'politburo' of top leaders and commanders) for several months. Instead, he relies on the narrower Kandahar Council of Clerics for legal advice.}}|</ref>}} There is an advisory [[Leadership Council of Afghanistan|Leadership Council]], however its role is in question as the supreme leader has not convened it for many months ( {{as of|lc|=|y|2023|03|post|=|),}} and increasingly rules by decree.",
 'sovereignty_type': '[[History of Afghanistan|Formation]]',
 'established_event1': '[[Hotak dynasty]]',
 'established_date1': '[[Mirwais Hotak|1709]]',
 'established_event2': '{{nowrap|[[Durrani Empire]]}}',
 'established_date2': '1747',
 'established_event3': '[[Emirate of Afghanistan|Emirate]]',
 'established_date3': '1823',
 'established_event4': '[[Dost Mohammad Khan|Dost Mohammad unites Afghanistan]]',
 'established_date4': '[[Herat Campaign of 1862–63|27 May 1863]]',
 'established_event5': '[[Third Anglo-Afghan War|Independence]]',
 'established_date5': '[[Afghan Independence Day|19 August 1919]]',
 'established_event6': '[[Kingdom of Afghanistan|Kingdom]]',
 'established_date6': '9 June 1926',
 'established_event7': '[[Republic of Afghanistan (1973–1978)|Republic]]',
 'established_date7': "[[1973 Afghan coup d'état|17 July 1973]]",
 'established_event8': '[[Democratic Republic of Afghanistan|Democratic Republic]]',
 'established_date8': '[[Saur Revolution|27–28 April 1978]]',
 'established_event9': '[[Islamic State of Afghanistan|Islamic State]]',
 'established_date9': '28 April 1992',
 'established_event10': '[[Islamic Emirate of Afghanistan (1996–2001)|Islamic Emirate]]',
 'established_date10': '27 September 1996',
 'established_event11': '{{nowrap|[[Islamic Republic of Afghanistan|Islamic Republic]]}}',
 'established_date11': '26 January 2004',
 'established_event12': '[[Fall of Kabul (2021)|Restoration of Islamic Emirate]]',
 'established_date12': '15 August 2021',
 'area_km2': '652,867',
 'area_rank': '40th',
 'area_sq_mi': '252,072',
 'percent_water': 'negligible',
 'population_estimate': '{{IncreaseNeutral}} 41,128,771',
 'population_estimate_year': '2023',
 'population_estimate_rank': '37th',
 'population_density_km2': '48.08',
 'population_density_sq_mi': '119',
 'GDP_PPP': '$81.007&nbsp;billion',
 'GDP_PPP_year': '2020',
 'GDP_PPP_per_capita': '$2,459',
 'GDP_nominal': '$20.136&nbsp;billion',
 'GDP_nominal_year': '2020',
 'GDP_nominal_per_capita': '$611',
 'HDI': '0.462',
 'HDI_year': '2022',
 'HDI_change': 'decrease',
 'HDI_rank': '182nd',
 'currency': '[[Afghan afghani|Afghani]] ( {{lang|prs|افغانى}} )',
 'currency_code': 'AFN',
 'time_zone': '[[Afghanistan Time]]',
 'utc_offset': '+4:30<br />[[Lunar Hijri calendar|Lunar Calendar]]',
 'DST_note': "''[[Daylight saving time|DST]] is not observed''",
 'cctld': '[[.af]]'}

We will define a list of features that we want from the infoboxes as follows

In [17]:
# create an empty list
wiki_data = []
# attributes of interest contained within the wiki infoboxes
features = ['population_estimate', 'population_estimate_year', 'GDP_PPP', 'GDP_PPP_year', 'GDP_PPP_rank', 'GDP_PPP_per_capita',
        'GDP_PPP_per_capita_rank']

# fetching the data for all 10 countries
for country in countries:    
    page = wptools.page(country) # create a page object
    try:
        page.get_parse() # call the API and parse the data
        if page.data['infobox'] != None:
            # if infobox is present
            infobox = page.data['infobox']
            # get data for the interested features/attributes
            data = { feature : infobox[feature] if feature in infobox else '' 
                         for feature in features }
        else:
            data = { feature : '' for feature in features }

        data['country_name'] = country
        wiki_data.append(data)

    except KeyError:
        pass
    
# checking the first entity of our list
wiki_data[0]
en.wikipedia.org (parse) Afghanistan
Afghanistan (en) data
{
  infobox: <dict(79)> conventional_long_name, common_name, native_...
  iwlinks: <list(11)> https://commons.wikimedia.org/wiki/%D8%A7%D9...
  pageid: 737
  parsetree: <str(403055)> <root><template><title>Short descriptio...
  requests: <list(1)> parse
  title: Afghanistan
  wikibase: Q889
  wikidata_url: https://www.wikidata.org/wiki/Q889
  wikitext: <str(326086)> {{Short description|Country in Central A...
}
en.wikipedia.org (parse) Albania
Albania (en) data
{
  infobox: <dict(91)> conventional_long_name, native_name, common_...
  iwlinks: <list(25)> https://commons.wikimedia.org/wiki/Atlas_of_...
  pageid: 738
  parsetree: <str(352873)> <root><template><title>short descriptio...
  requests: <list(1)> parse
  title: Albania
  wikibase: Q222
  wikidata_url: https://www.wikidata.org/wiki/Q222
  wikitext: <str(275754)> {{short description|Country in Southeast...
}
en.wikipedia.org (parse) Algeria
Algeria (en) data
{
  infobox: <dict(75)> conventional_long_name, native_name, common_...
  iwlinks: <list(11)> https://commons.wikimedia.org/wiki/%D8%A7%D9...
  pageid: 358
  parsetree: <str(261239)> <root><template><title>short descriptio...
  requests: <list(1)> parse
  title: Algeria
  wikibase: Q262
  wikidata_url: https://www.wikidata.org/wiki/Q262
  wikitext: <str(209792)> {{short description|Country in North Afr...
}
en.wikipedia.org (parse) Andorra
Andorra (en) data
{
  infobox: <dict(74)> conventional_long_name, common_name, native_...
  iwlinks: <list(16)> https://ca.wikipedia.org/wiki/Castellers_d%2...
  pageid: 600
  parsetree: <str(196369)> <root><template><title>short descriptio...
  requests: <list(1)> parse
  title: Andorra
  wikibase: Q228
  wikidata_url: https://www.wikidata.org/wiki/Q228
  wikitext: <str(137746)> {{short description|Country in Europe}}{...
}
en.wikipedia.org (parse) Angola
Angola (en) data
{
  infobox: <dict(64)> conventional_long_name, common_name, native_...
  iwlinks: <list(14)> https://commons.wikimedia.org/wiki/Angola, h...
  pageid: 701
  parsetree: <str(214016)> <root><template><title>Short descriptio...
  requests: <list(1)> parse
  title: Angola
  wikibase: Q916
  wikidata_url: https://www.wikidata.org/wiki/Q916
  wikitext: <str(171019)> {{Short description|Country on the west ...
}
en.wikipedia.org (parse) Antigua and Barbuda
Antigua and Barbuda (en) data
{
  infobox: <dict(77)> conventional_long_name, languages_type, lang...
  iwlinks: <list(11)> https://commons.wikimedia.org/wiki/Antigua_a...
  pageid: 951
  parsetree: <str(109361)> <root><template><title>Short descriptio...
  requests: <list(1)> parse
  title: Antigua and Barbuda
  wikibase: Q781
  wikidata_url: https://www.wikidata.org/wiki/Q781
  wikitext: <str(81941)> {{Short description|Country in the Lesser...
}
en.wikipedia.org (parse) Argentina
Argentina (en) data
{
  infobox: <dict(82)> conventional_long_name, native_name, common_...
  iwlinks: <list(19)> https://commons.wikimedia.org/wiki/Argentina...
  pageid: 18951905
  parsetree: <str(392630)> <root><template><title>Short descriptio...
  requests: <list(1)> parse
  title: Argentina
  wikibase: Q414
  wikidata_url: https://www.wikidata.org/wiki/Q414
  wikitext: <str(255647)> {{Short description|Country in South Ame...
}
en.wikipedia.org (parse) Armenia
Armenia (en) data
{
  infobox: <dict(107)> conventional_long_name, common_name, native...
  iwlinks: <list(14)> https://commons.wikimedia.org/wiki/%D5%80%D5...
  pageid: 10918072
  parsetree: <str(289933)> <root><template><title>short descriptio...
  requests: <list(1)> parse
  title: Armenia
  wikibase: Q399
  wikidata_url: https://www.wikidata.org/wiki/Q399
  wikitext: <str(232051)> {{short description|Country in West Asia...
}
en.wikipedia.org (parse) Australia
Australia (en) data
{
  infobox: <dict(75)> conventional_long_name, common_name, image_f...
  iwlinks: <list(11)> https://commons.wikimedia.org/wiki/Atlas_of_...
  pageid: 4689264
  parsetree: <str(335972)> <root><template><title>Short descriptio...
  requests: <list(1)> parse
  title: Australia
  wikibase: Q408
  wikidata_url: https://www.wikidata.org/wiki/Q408
  wikitext: <str(273829)> {{Short description|Country in Oceania}}...
}
en.wikipedia.org (parse) Austria
Austria (en) data
{
  infobox: <dict(82)> conventional_long_name, common_name, native_...
  iwlinks: <list(10)> https://commons.wikimedia.org/wiki/%C3%96ste...
  pageid: 26964606
  parsetree: <str(229693)> <root><template><title>Short descriptio...
  requests: <list(1)> parse
  title: Austria
  wikibase: Q40
  wikidata_url: https://www.wikidata.org/wiki/Q40
  wikitext: <str(180772)> {{Short description|Country in Central E...
}
Out[17]:
{'population_estimate': '{{IncreaseNeutral}} 41,128,771',
 'population_estimate_year': '2023',
 'GDP_PPP': '$81.007&nbsp;billion',
 'GDP_PPP_year': '2020',
 'GDP_PPP_rank': '',
 'GDP_PPP_per_capita': '$2,459',
 'GDP_PPP_per_capita_rank': '',
 'country_name': 'Afghanistan'}

And last part, save it into a json file for later use. Next step offcourse wrangling and cleaning our dataset. Enjoy!

In [18]:
with open('infoboxes.json', 'w') as file:
    json.dump(wiki_data, file)

Legal use of web scraping

There are many misinformations about web scraping. Many people believe it is illegal to use it but that only occurs when you try to scrape personal informations like a private message of a person. There are some rules when you are scraping data.

Scraping Rules

  1. You should check a website’s Terms and Conditions before you scrape it. Be careful to read the statements about legal use of data. Usually, the data you scrape should not be used for commercial purposes. Check the Website Terms of Service: Start by reading the terms of service, terms of use, or website’s “robots.txt” file (You can find it usually adding in the home page of the site the extention: /robots.txt). Many websites explicitly state whether web scraping is allowed or prohibited. Following these terms is essential for legal scraping.
  2. Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
  3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed

Consult Legal Advice: When in doubt, make sure you consult with a legal counsel who specializes in technology and internet law. They can provide advice specific to your situation and jurisdiction.