1. 程式人生 > >Dataset creation and cleaning: Web Scraping using Python

Dataset creation and cleaning: Web Scraping using Python

In my last article, I discussed about generating a dataset using the Application Programming Interface (API) and Python libraries. APIs allow us to draw very useful information from a website in an easy manner. However, not all websites have APIs and this makes it difficult to gather relevant data. In such a case, we can use web scraping to access a website’s content and create our dataset.

Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format. — WebHarvy

Generally, web scraping involves accessing numerous websites and collecting data from them. However, we can limit ourselves to collect large amount of information from a single source and use it as a dataset. In this particular example, we’ll explore Wikipedia. I’ll also explain the HTML basics we would need. The complete project is available as a Notebook in the Github repository

Web Scraping using Python.

This example is just for demonstration purpose. However, we must always follow the website guidelines before we can scrape that website and access its data for any commercial purpose.

This is a 2 part article. In this first part, we’ll explore how to get the data from the website using BeautifulSoup and in the second part, we’ll clean the collected dataset.

Determine the content

“man drawing on dry-erase board” by Kaleidico on Unsplash

We’ll access the List of countries and dependencies by population Wikipedia webpage. The webpage includes a table with the names of countries, their population, date of data collection, percentage of world population and source. And if we go to any country’s page, all information about it is written on the page with a standard box on the right. This box includes a lot of information such as total area, water percentage, GDP etc.

Here, we will combine the data from these two webpages into one dataset.

  1. List of Countries: On accessing the first page, we’ll extract the list of countries, their population and percentage of world population.
  2. Country: We’ll then access each country’s page, and get information including total area, percentage water, and GDP (nominal).

Thus, our final dataset will include information about each country.

HTML Basics

Each webpage that you view in your browser is actually structured in HyperText Markup Language (HTML). It has two parts, head which includes the title and any imports for styling and JavaScript and the body which includes the content that gets displayed as a webpage. We’re interested in the body of the webpage.

HTML is comprised of tags. A tag is described by an opening < and closing > angular bracket with the name of the tag inside it as a start, while it marks an ending if there is a forward slash / after the opening angular bracket. For example, <div></div>, <p>Some text</p> etc.

Homepage.html as an example

There are two direct ways to access any element (tag) present on the webpage. We can use id, which is unique or we can use a class which can be associated with multiple elements. Here, we can see that <div> has the attribute id as base which acts as a reference to this element while all table cells marked by td have the same class called data.

Generally useful tags include:

  1. <div>: Whenever you include certain content, you enclose it together inside this single entity. It can act as the parent for a lot of different elements. So, if some style changes are applied here, they’ll also reflect in its child elements.
  2. <a>: The links are described in this tag, where the webpage that will get loaded on click of this link is mentioned in its property href.
  3. <p>: Whenever some information is to be displayed on the webpage as a block of text, this tag is used. Each such tag appears as its own paragraph.
  4. <span>: When information is to be displayed inline, we use this tag. When two such tags are placed side by side, they’ll appear in the same line unlike the paragraph tag.
  5. <table>: Tables are displayed in HTML with the help of this tag, where data is displayed in cells formed by intersection of rows and columns.

Import Libraries

We first begin by importing necessary libraries, namely, numpy, pandas, urllib and BeautifulSoup.

  1. numpy: A very popular library that makes array operations very simple and fast.
  2. pandas: It helps us to convert the data in a tabular structure, so we can manipulate the data with numerous functions that have been efficiently developed.
  3. urllib: We use this library to open the url from which we would like to extract the data.
  4. BeautifulSoup: This library helps us to get the HTML structure of the page that we want to work with. We can then, use its functions to access specific elements and extract relevant information.
Import all libraries

Understand the data

Initially, we define we just the basic function of reading the url and then extracting the HTML from the same. We’ll introduce new functions as and where they are needed.

Function to get HTML of a webpage

In the getHTMLContent() function, we pass in the URL. Here, we first open the url using the urlopen method. This enables us to apply BeautifulSoup library to get the HTML using a parser. While there are many parsers available, in this example we use html.parser which enables us to parse HTML files. Then, we simply return the output which we can then use to extract our data.

We use this function to get the HTML content for the Wikipedia page of List of countries. We see that the countries are present in a table. So, we use the find_all() method to find all tables on the page. The parameter that we supply inside this function determines the element that it returns. As we require tables, we pass the argument as table and then iterate over all tables to identify the one we need.

We print each table with the prettify() function. This function makes the output more readable. Now, we need to analyse the output and see which table has the data we are searching for. After much inspection, we can see that the table with the class, wikitable sortable, has the data we need. Thus, our next step is to access this table and its data. For this, we will use the function find() which allows us to not only specify the element we are looking for but also specify its properties such as the class name.

Print all country links

A table in HTML is comprised of rows denoted by the tags <tr></tr>. Each row has cells which can either be headings defined using <th></th> or data defined using <td></td>. Thus, to access each country’s webpage, we can get its link from the cells in the country column of the table (second column). So, we iterate over all the rows in the table and read the second columns’s data in the variable country_link. For each row, we extract the cells, and get the element a in second column (numbering in Python starts with 0, so second column would mean cell[1]). Finally, we print all the links.

The links do not include the base address, so whenever we access any of these links, we’ll append as the prefix.

While the function I developed to extract the data from each country’s webpage might appear small, there have been many iterations for it before I finalised the function. Let’s explore it step by step.

Each country’s page includes an information box on the right which includes the Motto, Name, GDP, Area and other important features. So, first weidentified the name of this box by the same steps as before and it was a table with the class as infobox geography vcard. Next, we define the variable additional_details to collect all the information we will get from this page in an array which we can then append with the list of countries dataset.

When we enter the inspect mode of Chrome browser (right click anywhere and select Inspect option) on the country page, we can look at the classes for each heading in the table. We are interested in four fields, Area — Total area, Water (%), and GDP (nominal) — Total, Per capita.