1. 程式人生 > >python爬蟲— 利用js2xml 獲取 script 資料

python爬蟲— 利用js2xml 獲取 script 資料

處理script中資料的最新方法,請看這個

主要介紹利用js2xml來獲取<script>資料

1. 待獲取網頁:url:https://s.taobao.com/search?q=%E6%89%8B%E6%9C%BA&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306

2. 待獲取資料

<script>
        
    g_page_config = {"pageName":"mainsrp","mods":{"shopcombotip":{"status":"hide"},"phonenav":{"status":"hide"},"debugbar":{"status":"hide"},"shopcombo":{"status":"hide"},"itemlist":{"status":"show","data":{"postFeeText":"運費","trace":"msrp_auction","auctions":[{"i2iTags":{"samestyle":{"url":""},"similar":{"url":"/search?type\u003dsimilar\u0026app\u003di2i\u0026rec_type\u003d1\u0026uniqpid\u003d\u0026nid\u003d558550356564"}},"p4pTags":[],"nid":"558550356564","category":"1512","pid":"","title":"【低至4298元起】Apple/蘋果 iPhone 8 64G 全網通4G\u003cspan class\u003dH\u003e手機\u003c/span\u003e 蘋果8","raw_title":"【低至4298元起】Apple/蘋果 iPhone 8 64G 全網通4G手機 蘋果8","pic_url":"//g-search3.alicdn.com/img/bao/uploaded/i4/i1/2616970884/TB1828xx7SWBuNjSszdXXbeSpXa_!!0-item_pic.jpg","detail_url":"//detail.tmall.com/item.htm?id\u003d558550356564\u0026ad_id\u003d\u0026am_id\u003d\u0026cm_id\u003d140105335569ed55e27b\u0026pm_id\u003d\u0026abbucket\u003d11","view_price":"4443.00","view_fee":"0.00","item_loc":"江蘇 南京","view_sales":"58626人付款","comment_count":"88600","user_id":"2616970884","nick":"蘇寧易購官方旗艦店","shopcard":{"levelClasses":[{"levelClass":"icon-supple-level-jinguan"},{"levelClass":"icon-supple-level-jinguan"},{"levelClass":"icon-supple-level-jinguan"},{"levelClass":"icon-supple-level-jinguan"},{"levelClass":"icon-supple-level-jinguan"}],"isTmall":true,"delivery":[487,1,1605],"description":[488,1,919],"service":[484,1,769],"encryptedUserId":"UvCxYMCkuvmg4MNTT","sellerCredit":20,"totalRate":10000},"icon":[{"title":"618大促活動1","dom_class":"icon-fest-618fenwei2018","position":"0","show_type":"0","icon_category":"baobei","outer_text":"0","html":"","icon_key":"icon-fest-618fenwei2018","trace":"srpservice","traceIdx":0,"innerText":"618大促活動1"},{"title":"尚天貓,就購了","dom_class":"icon-service-tianmao","position":"1","show_type":"0","icon_category":"baobei","outer_text":"0","html":"","icon_key":"icon-service-tianmao","trace":"srpservice","traceIdx":1,"innerText":"天貓寶貝","url":"//www.tmall.com/"}],"comment_url":"//detail.tmall.com/item.htm?
"""</script>

待獲取的資訊包含在<script>標籤中,不是想要的xml的格式,因此,除了可以利用正則表示式來提取資訊外,還可以使用js2xml來獲取資訊

3. js2xml介紹

首先,使用 parse(text, encoding='utf8', debug=False) 函式將獲取的資訊轉化為 <class 'lxml.etree._Element'>,,然後,在利用pretty_print(tree)將其轉化 xml 標籤樹。

4. 實際應用

import requests
from bs4 import BeautifulSoup
import js2xml

headers = {
    'cookie' : 't=d970fee880a8a21b3bc6c4cfc5214f06; cna=z9qXE/dcfB4CAXcx2lO5D6hy; miid=830238069184972498; UM_distinctid=163d4e10df42c9-008bee4f8dbd2e-6b1b1279-100200-163d4e10df531e; __guid=154677242.2026198364084961000.1528950691863.182; hng=CN%7Czh-CN%7CCNY%7C156; thw=cn; cookie2=1a9dd588fb9863dbcef85080ba49048d; v=0; _tb_token_=fb473336e3bee; alitrackid=www.taobao.com; lastalitrackid=www.taobao.com; JSESSIONID=1B0C308CF96CCA8721C294659D0603C2; monitor_count=8; CNZZDATA1272960300=1938859093-1528945321-https%253A%252F%252Fwww.taobao.com%252F%7C1528977724; isg=BGtrPwowtJC0BOh06VZZtSmH-o-VKH8KIAakd93oRqoBfIveZVAPUgne0rwS3Nf6',
    'referer' : 'https://www.taobao.com/',
    'user-agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    }

url = "https://s.taobao.com/search?q=%E6%89%8B%E6%9C%BA&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306"
r = requests.get(url, headers = headers)
demo=r.content
soup=BeautifulSoup(demo,'lxml')
src=soup.select('head script')[7].string
src_text = js2xml.parse(src, encoding='utf-8', debug=False)
print(type(src_text))
src_tree = js2xml.pretty_print(src_text)
print(src_tree)