python爬蟲-->抓取動態內容

阿新 • • 發佈：2019-01-22

上幾篇博文講的都是關於抓取靜態網頁的相關內容，但是現在市面上絕大多數主流網站都在其重要功能中依賴JavaScript，使用JavaScript時，不再是載入後立即下載所有頁面內容，這樣就會造成許多網頁在瀏覽器中展示的內容不會出現在html原始碼中。這時候再用前幾篇博文中介紹的辦法爬取來資料，得到的資料肯定為空。本篇博文將主要介紹對如動態網頁應該如何進行爬取。

這裡我們將介紹兩種辦法來抓取動態網頁資料
① JavaScript逆向工程
② 渲染JavaScript

本篇博文主要思路如下圖：

這裡寫圖片描述

如右側可以看出谷歌瀏覽器的控制生成了對應結果。那麼我們用前幾篇博文介紹的方法來對countries（國家名稱）資料進行爬取試試。

import lxml.html
from new_chapter3.downloader import Downloader
D=Downloader()
html=D('http://example.webscraping.com/places/default/search')
tree=lxml.html.fromstring(html)
print tree.cssselect('div#result a')

輸出結果
這裡寫圖片描述

那麼為什麼抓取失敗了呢？開啟網頁原始碼可以幫助我們瞭解失敗的原因。

這裡寫圖片描述

可以清楚的看出，我們要抓取的內容在原始碼中實際為空！！！可以看出谷歌瀏覽器控制檯顯示的是網頁當前的狀態，也就是使用JavaScript動態載入完搜尋結果之後的網頁。而這個網頁不會出現在html原始碼中。

這裡需要介紹下ajax：
AJAX即“Asynchronous Javascript And XML”（非同步JavaScript和XML），是指一種建立互動式網頁應用的網頁開發技術。
AJAX = 非同步 JavaScript和XML（標準通用標記語言的子集）。
AJAX 是一種用於建立快速動態網頁的技術。
通過在後臺與伺服器進行少量資料交換，AJAX 可以使網頁實現非同步更新。這意味著可以在不重新載入整個網頁的情況下，對網頁的某部分進行更新。
傳統的網頁（不使用 AJAX）如果需要更新內容，必須過載整個網頁頁面。

對動態網頁的逆向工程

某些網頁資料是通過JavaScript動態載入的，要想抓取該資料，我們需要了解網頁是如何載入該資料的，該過程稱為逆向工程。這裡我們使用谷歌瀏覽器，再次開啟

http://example.webscraping.com/places/default/search，按F12進入開發者模式，在name框輸入字母A，點選搜尋。然後在點選右邊控制檯Network,觀察其XHR下的資料
這裡寫圖片描述

from new_chapter3.downloader import Downloader
import json
D=Downloader()
## 這裡每搜尋後的結果中，每個頁面5個國家資訊，顯示的是第一個頁面，也就是首先的5個國家。
html=D('http://example.webscraping.com/places/ajax/search.json?&search_term=a&page_size=5&page=0')
print json.loads(html)

列印結果：
這裡寫圖片描述

結果中的records中有page_size個國家資訊。

通過搜尋字母表中的每個字母，可以抓取下所有的國家資訊。然後將其結果儲存到表格中。

#coding:utf-8

import json
import string
import new_chapter3.downloader as Downloader
import new_chapter3.mongo_cache as MongoCache

def main():
    template_url = 'http://example.webscraping.com/places/ajax/search.json?&search_term={}&page_size=10&page={}'
    countries = set()
    cache=MongoCache.MongoCache()
    cache.clear()
    download = Downloader.Downloader(cache=cache)

    for letter in string.lowercase:
        page = 0
        while True:
            html = download(template_url.format(letter,page))
            try:
                ajax = json.loads(html)
            except ValueError as e:
                print e
                ajax = None
            else:
                for record in ajax['records']:
                    countries.add(record['country'])
            page += 1
            if ajax is None or page >= ajax['num_pages']:
                break

    open('countries.txt', 'w').write('\n'.join(sorted(countries)))


if __name__ == '__main__':
    main()

這裡寫圖片描述

這裡面需要注意的是，下載這麼多網頁要注意延遲一會，否則會報too many requests。不過這個download函式裡面已經有同一域名內下載不同網頁延遲的功能。

這裡我們依次遍歷了26個字母，這個辦法抓取所有資料顯得比較笨，我們可以用.來代替所有字母，這樣就可以一次性獲取到所有資料。

import json
import csv
import new_chapter3.downloader as downloader


def main():
    writer = csv.writer(open('countries2.csv', 'w'))
    D = downloader.Downloader()
    html = D('http://example.webscraping.com/places/ajax/search.json?&search_term=.&page_size=1000&page=0')
    ajax = json.loads(html)
    for record in ajax['records']:
        writer.writerow([record['country']])


if __name__ == '__main__':
    main()

這裡寫圖片描述

比起上面26個字母慢慢抓取的辦法，這種辦法不僅程式碼量少，而且很快。

渲染動態網頁

有些網站很複雜，不太容易對其進行逆向工程進行解析來獲取資料。我們可以利用瀏覽器渲染引擎來對網頁進行渲染。這種渲染引擎是在瀏覽器在顯示網頁時解析html，應用css樣式並執行JavaScript語句的部分。這裡使用Webkit渲染引擎，通過QT框架可以獲取該引擎的python介面。

這裡寫圖片描述

我們先嚐試用傳統方法下載原始html來獲取資料：

import lxml.html
from new_chapter3.downloader import Downloader
import json
D=Downloader()
html=D('http://example.webscraping.com/places/default/dynamic')
tree=lxml.html.fromstring(html)
print "結果為: ",tree.cssselect('#result')[0].text_content()

列印結果為：
這裡寫圖片描述

結果為空，解析其html獲取不了JavaScript內的資料。我們再嘗試使用webkit來獲取資料：

#coding:utf-8

try:
    from PySide.QtGui import *
    from PySide.QtCore import *
    from PySide.QtWebKit import *
except ImportError:
    from PyQt4.QtGui import *
    from PyQt4.QtCore import *
    from PyQt4.QtWebKit import *
import lxml.html
import new_chapter3.downloader as downloader


def direct_download(url):
    download = downloader.Downloader()
    return download(url)

def webkit_download(url):
    ##初始化QApplication物件，在其他QT物件完成初始化之前。QT框架需要先建立該物件
    app = QApplication([])
    ##建立QWebView物件，該物件是Web文件的容器
    webview = QWebView()
    webview.loadFinished.connect(app.quit)
    webview.load(url)
    app.exec_() # delay here until download finished
    return webview.page().mainFrame().toHtml()


def parse(html):
    tree = lxml.html.fromstring(html)
    print tree.cssselect('#result')[0].text_content()


def main():
    url = 'http://example.webscraping.com/places/default/dynamic'
    #parse(direct_download(url))
    parse(webkit_download(url))
    return
    print len(r.html)


if __name__ == '__main__':
    main()

列印結果：
這裡寫圖片描述

成功獲取到JavaScript內資料，這種方式就是利用webKit執行JavaScript，然後訪問生成的html，然後再進行解析獲取資料。

上面介紹的瀏覽器引擎還不能抓取搜尋頁面，為此我們還需要對瀏覽器渲染引擎進行擴充套件，使其支援與網站的互動功能。

對與上面的ajax搜尋示例，下面給出另外一個版本，支援互動功能。

#coding:utf-8

try:
    from PySide.QtGui import QApplication
    from PySide.QtCore import QUrl, QEventLoop, QTimer
    from PySide.QtWebKit import QWebView
except ImportError:
    from PyQt4.QtGui import QApplication
    from PyQt4.QtCore import QUrl, QEventLoop, QTimer
    from PyQt4.QtWebKit import QWebView
import lxml.html as lm

def main():
    '''
    首先設定搜尋引數和模擬動作事件，獲取在此引數和動作下搜尋後得到的網頁
    然後在這網頁下，獲取資料
    '''
    app = QApplication([])
    webview = QWebView()
    loop = QEventLoop()
    webview.loadFinished.connect(loop.quit)
    webview.load(QUrl('http://example.webscraping.com/places/default/search'))
    loop.exec_()

    webview.show()## 顯示渲染視窗，,可以直接在這個窗口裡面輸入引數，執行動作，方便除錯
    frame = webview.page().mainFrame()
    ## 設定搜尋引數
    # frame.findAllElements('#search_term') ##尋找所有的search_term框，返回的是列表
    # frame.findAllElements('#page_size option:checked')
    # ## 表單使用evaluateJavaScript()方法進行提交，模擬點選事件
    # frame.findAllElements('#search')

    frame.findFirstElement('#search_term').setAttribute('value', '.') ##第一個search_term框
    frame.findFirstElement('#page_size option:checked').setPlainText('1000') ##第一個page_size框
    ## 表單使用evaluateJavaScript()方法進行提交，模擬點選事件
    frame.findFirstElement('#search').evaluateJavaScript('this.click()') ##第一個點選框

    ## 輪詢網頁，等待特定內容出現
    ## 下面不斷迴圈，直到國家連結出現在results這個div元素中，每次迴圈都會呼叫app.processEvents()
    ##用於給QT事件執行任務的時間，比如響應事件和更新GUI
    elements = None
    while not elements:
        app.processEvents()
        elements = frame.findAllElements('#results a') ##查詢下載網頁內的所有a標籤
    countries = [e.toPlainText().strip() for e in elements] ##取出所有a標籤內的文字內容
    print countries


if __name__ == '__main__':
    main()

為了提高程式碼的易用性，我們把使用到的方法封裝到一個類中：

#coding:utf-8

import re
import csv
import time

try:
    from PySide.QtGui import QApplication
    from PySide.QtCore import QUrl, QEventLoop, QTimer
    from PySide.QtWebKit import QWebView
except ImportError:
    from PyQt4.QtGui import QApplication
    from PyQt4.QtCore import QUrl, QEventLoop, QTimer
    from PyQt4.QtWebKit import QWebView
import lxml.html


class BrowserRender(QWebView):
    def __init__(self, display=True):
        self.app = QApplication([])
        QWebView.__init__(self)
        if display:
            ## 顯示渲染視窗，,可以直接在這個窗口裡面輸入引數，執行動作，方便除錯
            self.show()  # show the browser

    def open(self, url, timeout=60):
        """Wait for download to complete and return result"""
        loop = QEventLoop()
        timer = QTimer() ## 設定定時器
        timer.setSingleShot(True)
        timer.timeout.connect(loop.quit)
        self.loadFinished.connect(loop.quit)
        self.load(QUrl(url))
        timer.start(timeout * 1000)
        loop.exec_()  # delay here until download finished
        if timer.isActive(): ##如果定時器還是活躍，那麼說明下載沒有超時
            # downloaded successfully
            timer.stop()
            return self.html() ##網頁下載完成後，將其轉成html，方便在html上進行資料抽取
        else:
            # timed out
            print 'Request timed out:', url

    def html(self):
        """Shortcut to return the current HTML"""
        return self.page().mainFrame().toHtml()

    def find(self, pattern):
        """Find all elements that match the pattern"""
        """因為這個網頁中每個框的名字都不一樣，故findFirstElement和findAllElements沒有什麼區別"""
        ##只不過findFirstElement返回是一個一個元素
        ## findFirstElement返回是一個列表，用這個列表時需要遍歷
        return self.page().mainFrame().findFirstElement(pattern)
        #return self.page().mainFrame().findFirstElement(pattern)

    def attr(self, pattern, name, value):
        """Set attribute for matching elements"""
        self.find(pattern).setAttribute(name, value)
        # for e in self.find(pattern):
        #     e.setAttribute(name, value)

    def text(self, pattern, value):
        """Set attribute for matching elements"""
        self.find(pattern).setPlainText(value)
        # for e in self.find(pattern):
        #     e.setPlainText(value)

    def click(self, pattern):
        """Click matching elements"""
        self.find(pattern).evaluateJavaScript("this.click()")
        # for e in self.find(pattern):
        #     e.evaluateJavaScript("this.click()")

    def wait_load(self, pattern, timeout=60):
        """Wait for this pattern to be found in webpage and return matches"""
        ## 設定定時器，跟蹤等待時間，並在截止事件前取消迴圈，否則當網路出現問題時，事件會無休止的執行下去
        deadline = time.time() + timeout
        while time.time() < deadline:
            self.app.processEvents()
            matches = self.page().mainFrame().findAllElements(pattern)
            if matches:
                return matches
        print 'Wait load timed out'


def main():
    '''
   首先設定搜尋引數和模擬動作事件，獲取在此引數和動作下搜尋後得到的網頁
   然後在這網頁下，查詢相關內容
   '''
    br = BrowserRender()
    br.open('http://example.webscraping.com/places/default/search')
    br.attr('#search_term', 'value', '.')
    br.text('#page_size option:checked', '1000')
    br.click('#search')

    ##設定定時器，跟蹤等待時間，並在截止事件前取消迴圈，否則當網路出現問題時，事件會無休止的執行下去
    elements = br.wait_load('#results a')
    writer = csv.writer(open('countries1.csv', 'w'))
    #for country in [e.toPlainText().strip() for e in elements]:
    for country in [e.toPlainText().strip() for e in elements]:
        print country,
        writer.writerow([country])


if __name__ == '__main__':
    main()

python爬蟲-->抓取動態內容

python爬蟲-->抓取動態內容

Python爬蟲抓取動態資料

Python網路爬蟲抓取動態網頁並將資料存入資料庫MYSQL

Python爬蟲抓取煎蛋(jandan.net)無聊圖

JAVA使用Gecco爬蟲抓取網頁內容

Python爬蟲抓取東方財富網股票數據並實現MySQL數據庫存儲

python爬蟲抓取zabbix監控圖，並發郵件

Python爬蟲--抓取單一頁面上的圖片文件學習

Python爬蟲 —— 抓取美女圖片

Python爬蟲 —— 抓取美女圖片（Scrapy篇）

python爬蟲-- 抓取網頁、圖片、文章

Python爬蟲抓取純靜態網站及其資源

用python爬蟲抓取視訊網站所有電影

Python爬蟲-抓取divnil動漫妹子圖

第一個Python爬蟲-抓取煎蛋網上圖片

Python爬蟲抓取大資料崗位招聘資訊（51job為例）

使用python爬蟲抓取學術論文

Python爬蟲爬取動態頁面思路+例項（一）

Python-爬蟲-抓取頭條街拍圖片-1.1

python 爬蟲, 抓取百度美女吧圖片

python爬蟲-->抓取動態內容

相關推薦