1. 程式人生 > >1.0 An Introduction to Web Scraping using Python

1.0 An Introduction to Web Scraping using Python

Web Scraping 101– 1.0 An Introduction to Web Scraping using Python

Web Scraping or Data Mining is the process of automated gathering of data from the internet. This is most commonly accomplished by writing an automated program that queries a web server, requests data (usually in the form of HTML and other files that compose web pages), and then parses that data

to extract needed information.

When one machine wants to communicate with another machine, the following things happen -

  1. A sends along a stream of 1 and 0 bits, indicated by high and low voltages on a wire. These bits form some information, containing a header and body. The header contains an immediate destination of his local router’s MAC address, with a final destination of B’s IP address. The body contains his request for B’s server application.
  2. A’s local router receives all these 1s and 0s and interprets them as a packet, from A’s own MAC address, destined for B’s IP address. His router stamps its own IP address on the packet as the “from” IP address, and sends it off across the internet.

3. A’s packet traverses several intermediary servers, which direct his packet toward the correct physical/wired path, on to B’s server.

4.B’s server receives the packet at its IP address.

5. B’s server reads the packet port destination in the header and passes it off to the appropriate application — the web server application. (The packet port destination is almost always port 80 for web applications; this can be thought of as an apartment number for packet data, whereas the IP address is like the street address.)

6. The web server application receives a stream of data from the server processor. This data says something like the following: — This is a GET request. — The following file is requested: index.html.

7. The web server locates the correct HTML file, bundles it up into a new packet to send to A, and sends it through to its local router, for transport back to A, through the same process.

The browser is a cool tool to do all this at once. But on a ground level, browser tells the processor to send a request for new data and then pass it to an application for processing, but in Python, the same can be done in 3 lines of code -

Outputs the complete HTML file for page1 located at the URL “http:// pythonscraping.com/pages/page1.html”

urllib is a standard Python library (meaning you don’t have to install anything extra to run this example) and contains functions for requesting data across the web, handling cookies, and even changing metadata such as headers and your user agent.

We will be extensively using urllib throughout this course…