python資料採集1-初見爬蟲

阿新 • • 發佈：2018-12-09

初見網路爬蟲

網路連線

註解

當我們在訪問百度(http://www.baidu.com/)，在你敲入網址並按下回車之後，將會發生以下神奇的事情：

現在本地的hosts檔案中找url對應的ip，找不到舊區DNS伺服器中找

在DNS協議中，PC會向你的本地DNS伺服器求助（一般是路由器），希望從本地DNS伺服器那裡得到百度的IP，得到就好，得不到還得向更高層次的DNS伺服器求助，最終總能得到百度的IP。

根據ip找到伺服器，建立TCP連線

在TCP協議中，建立TCP需要與百度伺服器握手三次，你先告訴伺服器你要給伺服器發東西（SYN），伺服器應答你並告訴你它也要給你發東西（SYN、ACK），然後你應答伺服器（ACK），總共來回了3次，稱為3次握手。

將url後面的一坨請求傳送給伺服器
伺服器根據收到的請求，將對應的資源傳送給客戶端

讓我們看看 Python 是如何實現的

# -*- coding: utf-8 -*-
"""
Created on Sun Jan 21 18:47:08 2018

@author: szm
"""

from urllib.request import urlopen
html = urlopen("http://www.baidu.com")
print(html.read())

返回的結果如下

b'<!DOCTYPE html>\n<!--STATUS OK-->\n 
\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\t\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r 
\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\t\r\n        \r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\t\t    \r\n\r\n\t\r\n        \r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\t\t    \r\n\r\n\r\n\r\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\r\n\n<html>\n<head>\n    \n    <meta http-equiv="content-type" content="text/html;charset=utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=Edge">\n\t<meta content="always" name="referrer">\n    <meta name="theme-color" content="#2932e1">\n    <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />\n    <link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="\xe7\x99\xbe\xe5\xba\xa6\xe6\x90\x9c\xe7\xb4\xa2" />\n    <link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu_85beaf5496f291521eb75ba38eacbd87.svg">\n\t\n\t\n\t<link rel="dns-prefetch" href="//s1.bdstatic.com"/>\n\t<link rel="dns-prefetch" href="//t1.baidu.com"/>\n\t<link rel="dns-prefetch" href="//t2.baidu.com"/>\n\t<link rel="dns-prefetch" href="//t3.baidu.com"/>\n\t<link rel="dns-prefetch" href="//t10.baidu.com"/>\n\t<link rel="dns-prefetch" href="//t11.baidu.com"/>\n\t<link rel="dns-prefetch" href="//t12.baidu.com"/>\n\t<link rel="dns-prefetch" href="//b1.bdstatic.com"/>\n    \n    <title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title>\n    \r\n\r\n<style id="css_index" index="index" type="text/css">html,body{height:100%}\nhtml{overflow-y:auto}\nbody{font:12px arial;text-align:;background:#fff}\nbody,p,form,ul,li{margin:0;padding:0;list-style:none}\nbody,form,#fm{position:relative}\ntd{text-align:left}\nimg{border:0}\na{color:#00c}\na:active{color:#f60}\ninput{border:0;padding:0}\n#wrapper{position:relative;_position:;min-height:100%}\n#head{padding-bottom:100px;text-align:center;*z-index:1}\n#ftCon{height:50px;position:absolute;bottom:47px;text-align:left;width:100%;margin:0 auto;z-index:0;overflow:hidden}\n.ftCon-Wrapper{overflow:hidden;margin:0 auto;text-align:center;*width:640px}\n.qrcodeCon{text-align:center;position:absolute;bottom:140px;height:60px;width:100%}\n#qrcode{display:inline-block;*float:left;*margin-top:4px}\n#qrcode .qrcode-item{float:left}\n#qrcode .qrcode-item-2{margin-left:33px}\n#qrcode .qrcode-img{width:60px;height:60px}\n#qrcode .qrcode-item-1 .qrcode-img{background:url(http://s1.bdstatic.com/r/www/cache/static/home/img/qrcode/zbios_efde696.png) 0 0 no-repeat}\n#qrcode .qrcode-item-2 .qrcode-img{background:url(http://s1.bdstatic.com/r/www/cache/static/home/img/qrcode/nuomi_365eabd.png) 0 0 no-repeat}\[email protected] only screen and (-webkit-min-device-pixel-ratio:2){#qrcode .qrcode-item-1 .qrcode-img{background-image:url(http://s1.bdstatic.com/r/www/cache/static/home/img/qrcode/zbios_x2_9d645d9.png);background-size:60px 60px}\n#qrcode .qrcode-item-2 .qrcode-img{background-image:url(http://s1.bdstatic.com/r/www/cache/static/home/img/qrcode/nuomi_x2_55dc5b7.png);background-size:60px 60px}}\n#qrcode .qrcode-text{color:#999;line-height:23px;margin:3px 0 0 5px}\n#qrcode .qrcode-text a{color:#999;text-decoration:none}\n#qrcode .qrcode-text p{text-align:left}\n#qrcode .qrcode-text b{color:#666;font-weight:700}\n#qrcode .qrcode-text span{letter-spacing:1px}\n#ftConw{display:inline-block;text-align:left;margin-left:33px;line-height:22px;position:relative;top:-2px;*float:right;*margin-left:0;*position:static}\n#ftConw,#ftConw a{color:#999}\n#ftConw{text-align:center;margin-left:0}\n.bg{background-image:url(http://s1.bdstatic.com/r/www/cache/static/global/img/icons_5859e57.png);background-repeat:no-repeat;_background-image:url(http://s1.bdstatic.com/r/www/cache/static/global/img/icons_d5b04cc.gif)}\n.c-icon{display:inline-block;width:14px;height

由於返回資訊過多,部分展示

from urllib.request import urlopen

它查詢 Python 的 request 模組（在 urllib 庫裡面），只匯入一個 urlopen 函式。

urlopen 用來開啟並讀取一個從網路獲取的遠端物件。因為它是一個非常通用的庫（它可以輕鬆讀取 HTML 檔案、影象檔案，或其他任何檔案流），所以我們將在本書中頻繁地使用它。

BeautifulSoup簡介

BeautifulSoup 庫的名字取自劉易斯 ·卡羅爾在《愛麗絲夢遊仙境》裡的同名詩歌

BeautifulSoup 嘗試化平淡為神奇。它通過定位 HTML 標籤來格式化和組織複雜的網路資訊，用簡單易用的 Python 物件為我們展現 XML 結構資訊。

安裝

Linux

$sudo apt-get install python-bs4

Mac

$sudo easy_install pip

$pip install beautifulsoup4

執行


from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.baidu.com")
bsObj = BeautifulSoup(html.read())
print(bsObj.img)

返回結果如下

<img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" usemap="#mp" width="270"/>

可以看出，我們從網頁中提取的<img>標籤被嵌在 BeautifulSoup 物件 bsObj 結構的第二層（html → body → img）。但是，當我們從物件裡提取 img 標籤的時候，可以直接呼叫它： bsObj.h1

bsObj.html.body.img
bsObj.body.img
bsObj.html.img

也可以達到同樣的效果

異常處理

網路是十分複雜的。網頁資料格式不友好，網站伺服器宕機，目標資料的標籤找不到，都是很麻煩的事情。網路資料採集最痛苦的遭遇之一，就是爬蟲執行的時候你洗洗睡了，夢想著明天一早資料就都會採集好放在資料庫裡，結果第二天醒來，你看到的卻是一個因某種資料格式異常導致執行錯誤的爬蟲，在前一天當你不再盯著螢幕去睡覺之後，沒過一會兒爬蟲就不再運行了。那個時候，你可能想罵發明網際網路（以及那些奇葩的網路資料格式）的人，但是你真正應該斥責的人是你自己，為什麼一開始不估計可能會出現的異常！

html = urlopen("http://www.baidu.com")

這行程式碼主要可能會發生兩種異常： - 網頁在伺服器上不存在（或者獲取頁面的時候出現錯誤） - 伺服器不存在

第一種異常發生時，程式會返回 HTTP 錯誤。HTTP 錯誤可能是“404 Page Not Found”“500 Internal Server Error”等。所有類似情形， urlopen 函式都會丟擲“HTTPError”異常。我們可以用下面的方式處理這種異常：

try:
html = urlopen("http://www.baidu.com")
except HTTPError as e:
print(e)
# 返回空值，中斷程式，或者執行另一個方案
else:
# 程式繼續。注意：如果你已經在上面異常捕捉那一段程式碼裡返回或中斷（break），
# 那麼就不需要使用else語句了，這段程式碼也不會執行

如果程式返回 HTTP 錯誤程式碼，程式就會顯示錯誤內容，不再執行 else 語句後面的程式碼。

if html is None:
print("URL is not found")
else:
# 程式繼續

如果你想要呼叫的標籤不存在，BeautifulSoup 就會返初見網路爬蟲｜ 9 回 None 物件。不過，如果再呼叫這個 None 物件下面的子標籤，就會發生 AttributeError錯誤

下面這行程式碼（ nonExistentTag 是虛擬的標籤，BeautifulSoup 物件裡實際沒有）

print(bsObj.nonExistentTag)

會返回一個 None 物件。處理和檢查這個物件是十分必要的。如果你不檢查，直接呼叫這個 None 物件的子標籤，麻煩就來了。如下所示。

print(bsObj.nonExistentTag.someTag)

這時就會返回一個異常：


AttributeError: 'NoneType' object has no attribute 'someTag'

try:
    badContent = bsObj.nonExistingTag.anotherTag
except AttributeError as e:
    print("Tag was not found")
else:
    if badContent == None:
        print ("Tag was not found")
    else:
        print(badContent)

初看這些檢查與錯誤處理的程式碼會覺得有點兒累贅，但是，我們可以重新簡單組織一下代碼，讓它變得不那麼難寫（更重要的是，不那麼難讀）。例如，下面的程式碼是上面爬蟲的另一種寫法：


from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import sys


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    try:
        bsObj = BeautifulSoup(html.read())
        title = bsObj.body.img
    except AttributeError as e:
        return None
    return title

title = getTitle("http://www.baidu.com")
if title == None:
    print("Title could not be found")
else:
    print(title)

在這個例子中，我們建立了一個 getTitle 函式，可以返回網頁的標題，如果獲取網頁的時候遇到問題就返回一個 None 物件。在 getTitle 函式裡面，我們像前面那樣檢查了 HTTPError ，然後把兩行 BeautifulSoup 程式碼封裝在一個 try 語句裡面。這兩行中的任何一行有問題， AttributeError 都可能被丟擲（如果伺服器不存在， html 就是一個 None 物件， html.read() 就會丟擲 AttributeError ）。其實，我們可以在 try 語句裡面放任意多行程式碼，或者放一個在任意位置都可以丟擲 AttributeError 的函式。

import warnings
warnings.filterwarnings("ignore")

python資料採集1-初見爬蟲

初見網路爬蟲

網路連線

BeautifulSoup簡介

安裝

執行

異常處理

python資料採集1-初見爬蟲

初識python爬蟲 Python網路資料採集1.0 BeautifulSoup安裝測試

Python資料分析 | (1)Python語法基礎

python資料型別1

python資料採集練習根據指定av號下載bilibili視訊（一）

python資料採集練習根據指定av號下載bilibili視訊（三）【用selenium操縱瀏覽器行為】

閱讀程式碼—整理學習python資料處理1

【Python資料分析】簡單爬蟲，爬取知乎神回覆

【python 資料結構 1：排序】氣泡排序和快速排序

Java資料採集--1.準備工作

python ：通過爬蟲爬取資料（1）

Python資料爬蟲學習筆記（1）讀取併合並Excel

python網路爬蟲-資料採集之遍歷單個爬蟲

python爬蟲案例——東方財富股票資料採集

Python網路爬蟲--歷史天氣資料採集

python實戰（1）：簡單的資料採集與分析

資料視覺化三步走（一）：資料採集與儲存，利用python爬蟲框架scrapy爬取網路資料並存儲

Python網路資料採集（爬蟲）

如何用elasticsearch構架億級資料採集系統（第1集：非生產環境windows安裝篇）

爬蟲03-京東資料採集

python資料採集1-初見爬蟲

初見網路爬蟲

網路連線

BeautifulSoup簡介

安裝

執行

異常處理

相關推薦