1. 程式人生 > >Python3爬取貓眼電影Top100(多程序)

Python3爬取貓眼電影Top100(多程序)

分析過程:

網頁原始碼關鍵部分(一對<dd></dd>標籤包括所有主要資訊):

<div class="content">
    <div class="wrapper">
        <div class="main">
            <p class="update-time">2018-08-05<span class="has-fresh-text">已更新</span></p>
            <p class="board-content">榜單規則:將貓眼電影庫中的經典影片,按照評分和評分人數從高到低綜合排序取前100名,每天上午10點更新。相關資料來源於“貓眼電影庫”。</p>
            <dl class="board-wrapper">
                <dd>
                        <i class="board-index board-index-1">1</i>
    <a href="/films/1203" title="霸王別姬" class="image-link" data-act="boarditem-click" data-val="{movieId:1203}">
      <img src="//ms0.meituan.net/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
      <img data-src="http://p1.meituan.net/movie/
[email protected]
_220h_1e_1c" alt="霸王別姬" class="board-img" /> </a> <div class="board-item-main"> <div class="board-item-content"> <div class="movie-item-info"> <p class="name"><a href="/films/1203" title="霸王別姬" data-act="boarditem-click" data-val="{movieId:1203}">霸王別姬</a></p> <p class="star"> 主演:張國榮,張豐毅,鞏俐 </p> <p class="releasetime">上映時間:1993-01-01(中國香港)</p> </div> <div class="movie-item-number score-num"> <p class="score"><i class="integer">9.</i><i class="fraction">6</i></p> </div> </div> </div> </dd> <dd> <i class="board-index board-index-2">2</i> <a href="/films/1297" title="肖申克的救贖" class="image-link" data-act="boarditem-click" data-val="{movieId:1297}"> <img src="//ms0.meituan.net/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" /> <img data-src="http://p0.meituan.net/movie/
[email protected]
_220h_1e_1c" alt="肖申克的救贖" class="board-img" /> </a> <div class="board-item-main"> <div class="board-item-content"> <div class="movie-item-info"> <p class="name"><a href="/films/1297" title="肖申克的救贖" data-act="boarditem-click" data-val="{movieId:1297}">肖申克的救贖</a></p> <p class="star"> 主演:蒂姆·羅賓斯,摩根·弗里曼,鮑勃·岡頓 </p> <p class="releasetime">上映時間:1994-10-14(美國)</p> </div> <div class="movie-item-number score-num"> <p class="score"><i class="integer">9.</i><i class="fraction">5</i></p> </div> </div> </div> </dd>

接著分析網頁的連結規律  發現

所以根據規律

成品程式碼:

import json
import re
from multiprocessing import Pool

import requests
from requests.exceptions import RequestException

def get_one_page(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?<img data-src="(.*?)"'
                         '.*?alt="(.*?)".*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
                         '.*?integer">(.*?)</i>.*?fraction">(\d+)</i>.*?</dd>', re.S)

    items = re.findall(pattern,html)
    #print(items)
    for item in items:#變成鍵值對的形式
        yield {
            'index': item[0],
            'image': item[1],
            'title': item[2],
            'actor': item[3].strip()[3:],
            'time': item[4].strip()[5:],
            'score': item[5]+item[6],
        }

def write_to_file(content):
    with open('maoyan.txt','a',encoding='utf-8') as f:
        f.write(json.dumps(content,ensure_ascii=False) + '\n')
        f.close()

def main(offset):
    url = 'http://maoyan.com/board/4' + str(offset)
    html = get_one_page(url)
    #parse_one_page(html)
    #print(html)
    for item in parse_one_page(html):
        print(item)
        write_to_file(item)

if __name__ == '__main__':
    #for i in range(10):
        #main(i*10)
    pool = Pool()
    pool.map(main, [i*10 for i in range(10)])

相關推薦

no