Python3爬取貓眼電影Top100(多程序)
阿新 • • 發佈:2019-01-29
分析過程:
網頁原始碼關鍵部分(一對<dd></dd>標籤包括所有主要資訊):
<div class="content"> <div class="wrapper"> <div class="main"> <p class="update-time">2018-08-05<span class="has-fresh-text">已更新</span></p> <p class="board-content">榜單規則:將貓眼電影庫中的經典影片,按照評分和評分人數從高到低綜合排序取前100名,每天上午10點更新。相關資料來源於“貓眼電影庫”。</p> <dl class="board-wrapper"> <dd> <i class="board-index board-index-1">1</i> <a href="/films/1203" title="霸王別姬" class="image-link" data-act="boarditem-click" data-val="{movieId:1203}"> <img src="//ms0.meituan.net/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" /> <img data-src="http://p1.meituan.net/movie/
[email protected]_220h_1e_1c" alt="霸王別姬" class="board-img" /> </a> <div class="board-item-main"> <div class="board-item-content"> <div class="movie-item-info"> <p class="name"><a href="/films/1203" title="霸王別姬" data-act="boarditem-click" data-val="{movieId:1203}">霸王別姬</a></p> <p class="star"> 主演:張國榮,張豐毅,鞏俐 </p> <p class="releasetime">上映時間:1993-01-01(中國香港)</p> </div> <div class="movie-item-number score-num"> <p class="score"><i class="integer">9.</i><i class="fraction">6</i></p> </div> </div> </div> </dd> <dd> <i class="board-index board-index-2">2</i> <a href="/films/1297" title="肖申克的救贖" class="image-link" data-act="boarditem-click" data-val="{movieId:1297}"> <img src="//ms0.meituan.net/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" /> <img data-src="http://p0.meituan.net/movie/[email protected]_220h_1e_1c" alt="肖申克的救贖" class="board-img" /> </a> <div class="board-item-main"> <div class="board-item-content"> <div class="movie-item-info"> <p class="name"><a href="/films/1297" title="肖申克的救贖" data-act="boarditem-click" data-val="{movieId:1297}">肖申克的救贖</a></p> <p class="star"> 主演:蒂姆·羅賓斯,摩根·弗里曼,鮑勃·岡頓 </p> <p class="releasetime">上映時間:1994-10-14(美國)</p> </div> <div class="movie-item-number score-num"> <p class="score"><i class="integer">9.</i><i class="fraction">5</i></p> </div> </div> </div> </dd>
接著分析網頁的連結規律 發現
所以根據規律
成品程式碼:
import json
import re
from multiprocessing import Pool
import requests
from requests.exceptions import RequestException
def get_one_page(url):
try:
response = requests.get(url)
if response.status_code == 200:
return response.text
return None
except RequestException:
return None
def parse_one_page(html):
pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?<img data-src="(.*?)"'
'.*?alt="(.*?)".*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
'.*?integer">(.*?)</i>.*?fraction">(\d+)</i>.*?</dd>', re.S)
items = re.findall(pattern,html)
#print(items)
for item in items:#變成鍵值對的形式
yield {
'index': item[0],
'image': item[1],
'title': item[2],
'actor': item[3].strip()[3:],
'time': item[4].strip()[5:],
'score': item[5]+item[6],
}
def write_to_file(content):
with open('maoyan.txt','a',encoding='utf-8') as f:
f.write(json.dumps(content,ensure_ascii=False) + '\n')
f.close()
def main(offset):
url = 'http://maoyan.com/board/4' + str(offset)
html = get_one_page(url)
#parse_one_page(html)
#print(html)
for item in parse_one_page(html):
print(item)
write_to_file(item)
if __name__ == '__main__':
#for i in range(10):
#main(i*10)
pool = Pool()
pool.map(main, [i*10 for i in range(10)])