Python中使用PhantomJS抓取Javascript網頁資料

阿新 • • 發佈：2019-01-31

有些網頁不是靜態載入的，而是通過javascirpt函式動態載入網頁，比如下面這個網頁，表格中的看漲合約和看跌合約的資料都是通過javascirpt函式從後臺載入。僅僅使用beautifulsoup並不能抓到這個表格中的資料。
這裡寫圖片描述

查詢資料，發現可以使用PhantomJS來抓取這類網頁的資料。但PhantomJS主要用於Java，若要在python中使用，則要通過Selenium在python中呼叫PhantomJS。寫程式碼時主要參考了這個網頁：Is there a way to use PhantomJS in Python?

Selenium是一個瀏覽器虛擬器，可以通過Selenium在各種瀏覽器上模擬各種行為。python中通過Selenium使用PhantomJS抓取動態網頁資料時需要安裝以下庫：
1. Beautifulsoup，用於解析網頁內容
2. Node.js
3. 安裝好Node.js之後通過Node.js安裝PhantomJS。在Mac終端中輸入npm -g install phantomjs

即可（Windows下的cmd也是一樣）
4. 安裝Selenium
完成上述四個步驟後即可在python中使用PhantomJS。

程式碼如下：

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import 
 expected_conditions as EC
import urllib2
import time

baseUrl = "http://stock.finance.sina.com.cn/option/quotes.html"
csvPath = "FinanceData.csv"
csvFile = open(csvPath, 'w')

def is_chinese(uchar):
        # 判斷一個unicode是否是漢字
    if uchar >= u'\u4e00' and uchar<=u'\u9fa5':
        return True
    else 
:
        return False

def readPage(url):
    webURL = urllib2.urlopen(baseUrl)
    content = webURL.read()
    soup = BeautifulSoup(content)
    return soup

def getFinance(soup, tableName):
    divs = soup.findAll('div', attrs={'class': tableName}) #看漲合約在這個div中
    if len(divs) < 0 or len(divs) == 0:
        print "No div class named " + str(tableName)
        return
    tbs = divs[0].findChildren('tbody') # 獲取tbody內容，在這個標籤下只有一個tbody
    print tbs[0]
    trs = tbs[0].findChildren('tr') # tr就是table中的每一行
    for tr in trs: 
        tds = tr.findChildren('td') # td是表格中的內容
        content = list()
        string = ""
        print tr

        index = 0 # 判斷漢字出現的位置
        for td in tds:
            temp = td.text
            print temp
            if index == 7 or index == 0:
                temp2 = ""
                for d in temp:
                    if not is_chinese(d): # 去除漢字
                        temp2 += d
                temp = temp2

            string = string + temp
            string = string + ","
            index += 1
        print string
        csvFile.write(string)
        csvFile.write('\n')

tableName = "table_down fr" # 表格名稱

driver = webdriver.PhantomJS(executable_path='/Users/Pan/node_modules/phantomjs/lib/phantom/bin/phantomjs')
driver.get(baseUrl)
data = driver.page_source # 獲取整個頁面的內容
driver.quit

#soup = readPage(loadUrl)
soup = BeautifulSoup(data)
getFinance(soup, tableName)
print "Finished!"
csvFile.close()

但是以上程式碼有問題！以上程式碼有時可以抓取到資料，有時抓取不到。原因時上述程式碼執行時有些資料還沒有被載入。因此需要判斷網頁何時載入了想要的資料。為了解決這個問題，Selenium提供了Waits機制，可以等待一段時間再讀取網頁。waits機制分為Explicit Waits和Implicit Waits。Waits方法可以和ExpectedCondition結合，這樣在抓取資料時可以等待一段時間，若這段時間滿足ExpectedCondition中指定的條件則執行後面的程式碼；若超過時間後仍未滿足指定的條件，則丟擲異常。下面是使用Explicit Waits的一段示例程式碼：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0

ff = webdriver.Firefox()
ff.get("http://somedomain/url_that_delays_loading")
try:
# 等待10s，若10s內能獲取到ID為myDynamicElement的內容，則執行後面的程式碼；否則丟擲異常
    element = WebDriverWait(ff, 10).until(EC.presence_of_element_located((By.ID, "myDynamicElement")))
finally:
    ff.quit()

本例中，看漲合約每次都可以抓取到，看跌合約則不穩定，大概網頁是先執行看漲合約的函式，再執行看跌合約的函式（猜的，沒有看程式碼），因此看跌合約資料的載入要慢一點。

修改後的程式碼如下：

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import urllib2
import time

baseUrl = "http://stock.finance.sina.com.cn/option/quotes.html"
csvPath = "FinanceData.csv"
csvFile = open(csvPath, 'w')

def is_chinese(uchar):
        # 判斷一個unicode是否是漢字
    if uchar >= u'\u4e00' and uchar<=u'\u9fa5':
        return True
    else:
        return False

def readPage(url):
    webURL = urllib2.urlopen(baseUrl)
    content = webURL.read()
    soup = BeautifulSoup(content)
    return soup

def getFinance(soup, tableName):
    divs = soup.findAll('div', attrs={'class': tableName}) #看漲合約在這個div中
    if len(divs) < 0 or len(divs) == 0:
        print "No div class named " + str(tableName)
        return
    tbs = divs[0].findChildren('tbody') # 獲取tbody內容，在這個標籤下只有一個tbody
    print tbs[0]
    trs = tbs[0].findChildren('tr') # tr就是table中的每一行
    for tr in trs: 
        tds = tr.findChildren('td') # td是表格中的內容
        content = list()
        string = ""
        print tr

        index = 0 # 判斷漢字出現的位置
        for td in tds:
            temp = td.text
            print temp
            if index == 7 or index == 0:
                temp2 = ""
                for d in temp:
                    if not is_chinese(d): # 去除漢字
                        temp2 += d
                temp = temp2

            string = string + temp
            string = string + ","
            index += 1

            #content.append(td.text)
        #print content
        #string = string[:-1]
        print string
        csvFile.write(string)
        csvFile.write('\n')

tableName = "table_down fr" # 表格名稱

driver = webdriver.PhantomJS(executable_path='/Users/Pan/node_modules/phantomjs/lib/phantom/bin/phantomjs')
driver.get(baseUrl)

####################################################
# wait 10s until the specified table name presents
try:
# 看漲合約和看跌合約的表格是一個class，要用CLASS_NAME指定
    element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, tableName)))
except Exception, e:
    print e
finally:
    data = driver.page_source # 取到載入js後的頁面content
    driver.quit
####################################################

#soup = readPage(loadUrl)
soup = BeautifulSoup(data)
getFinance(soup, tableName)
print "Finished!"
csvFile.close()

Python中使用PhantomJS抓取Javascript網頁資料

Python中使用PhantomJS抓取Javascript網頁資料

Python網路爬蟲抓取動態網頁並將資料存入資料庫MYSQL

【python爬蟲】抓取連結網頁內的文字（第一步定位超連結文字）

Python爬蟲：抓取手機APP資料

使用java開源工具httpClient及jsoup抓取解析網頁資料

python結合chrome抓取動態網頁思路

Python selenium爬蟲抓取船舶網站資料（動態頁面）

網路爬蟲中Fiddler抓取PC端網頁資料包與手機端APP資料包

Python爬蟲 BeautifulSoup抓取網頁資料並儲存到資料庫MySQL

Python 中利用urllib2簡單實現網頁抓取

python+selenium+PhantomJS爬取網頁動態加載內容

vue專案中jsonp抓取資料實現方式

微信好友大揭祕，使用Python抓取朋友圈資料，通過人臉識別全面分析好友，一起看透你的“朋友圈”

用Python抓取朋友圈資料，通過人臉識別全面分析好友！看透朋友圈

在使用python的selenium庫抓取動態網頁時，瀏覽器內容出現空白的解決方式

Python爬蟲練習之一：抓取美團資料

[Python爬蟲]Scrapy配合Selenium和PhantomJS爬取動態網頁

使用新浪微博官方API抓取微博資料（Python版）

（python解析js）selenium結合phantomjs抓取js生成的頁面

Python多程序抓取拉鉤網十萬資料

Python中使用PhantomJS抓取Javascript網頁資料

相關推薦