1. 程式人生 > >python requests抓取貓眼電影

python requests抓取貓眼電影

def res b- int nic status () tle proc

1. 網址:http://maoyan.com/board/4?

技術分享

2. 代碼:

技術分享
 1 import json
 2 from multiprocessing import Pool
 3 import requests
 4 from requests.exceptions import RequestException
 5 import re
 6 
 7 
 8 def get_one_page_html(url):
 9     try:
10         response = requests.get(url)
11         if response.status_code == 200:
12 return response.text 13 return None 14 except RequestException: 15 return None 16 17 def parse_one_page(html): 18 pattern = re.compile(<dd>.*?board-index.*?>(\d+)</i>.*?alt.*?src="(.*?)".*?name"><a 19 +.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>
20 +.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>, re.S)# .可以匹配任意的換行符 21 22 items = re.findall(pattern,html) 23 #(‘1‘, ‘http://p1.meituan.net/movie/[email protected]_220h_1e_1c‘, ‘霸王別姬‘, ‘\n 主演:張國榮,張豐毅,鞏俐\n ‘, ‘上映時間:1993-01-01(中國香港)‘, ‘9.‘, ‘6‘),
24 for item in items: 25 yield { 26 index : item[0], 27 image : item[1], 28 title:item[2], 29 actor : item[3].strip()[3:], 30 time: item[4].strip()[5:], 31 score : item[5] + item[6] 32 } 33 34 def write_to_file(content): 35 with open(result.txt, a, encoding=utf-8)as f: 36 f.write(json.dumps(content, ensure_ascii=False) + \n)#導入快捷見alt+enter,content內容是個字典,我們要把它變成字符串寫入文件,加入換行符,每行一個 37 f.close() 38 39 def main(offset): 40 url = http://maoyan.com/board/4?offset= + str(offset) 41 html = get_one_page_html(url) 42 for item in parse_one_page(html): 43 print(item) 44 write_to_file(item) #會變成unicode編碼,若想result.txt裏面是中文,需要修改write_to_file函數,加上encoding=‘utf-8’和ensure_ascii=False 45 46 if __name__ == __main__: 47 # for i in range(10): 48 # main(i*10) 49 50 pool = Pool() 51 pool.map(main, [i*10 for i in range(10)])
View Code

3. 結果:

技術分享

註意:

1.正則匹配要好好看看

2.將輸出的內容格式化,變成一個生成器字典

3.寫到文件的時候把unicode編碼變成中文顯示

4.進程池Pool。實現秒抓

python requests抓取貓眼電影