Python爬蟲之抓取貓眼電影TOP100

阿新 • • 發佈：2019-01-21

執行平臺：windows

Python版本：Python 3.7.0

IDE:Sublime Text

瀏覽器：Chrome瀏覽器

思路：

1.檢視網頁原始碼

2.抓取單頁內容

3.正則表示式提取資訊

4.貓眼TOP100所有資訊寫入檔案

5.多執行緒抓取

1.檢視網頁原始碼

按F12檢視網頁原始碼發現每一個電影的資訊都在“<dd></dd>”標籤之中。

點開之後，資訊如下：

二.抓取單頁內容

在瀏覽器中開啟貓眼電影網站，點選“榜單”，再點選“TOP100榜”如下圖：

接下來通過以下程式碼獲取網頁原始碼：

#-*-coding:utf-8-*-
import requests
from requests.exceptions import RequestException

#貓眼電影網站有反爬蟲措施，設定headers後可以爬取
headers = {
	'Content-Type': 'text/plain; charset=UTF-8',
	'Origin':'https://maoyan.com',
	'Referer':'https://maoyan.com/board/4',
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
	}

#爬取網頁原始碼
def get_one_page(url,headers):
	try:
		response =requests.get(url,headers =headers)
		if response.status_code == 200:
			return response.text
		return None
	except RequestsException:
		return None

def main():
	url = "https://maoyan.com/board/4"
	html = get_one_page(url,headers)
	print(html)

if __name__ == '__main__':
	main()

執行結果如下：

3.正則表示式提取資訊

上圖示示資訊即為要提取的資訊，程式碼實現如下：

#-*-coding:utf-8-*-
import requests
import re
from requests.exceptions import RequestException

#貓眼電影網站有反爬蟲措施，設定headers後可以爬取
headers = {
	'Content-Type': 'text/plain; charset=UTF-8',
	'Origin':'https://maoyan.com',
	'Referer':'https://maoyan.com/board/4',
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
	}

#爬取網頁原始碼
def get_one_page(url,headers):
	try:
		response =requests.get(url,headers =headers)
		if response.status_code == 200:
			return response.text
		return None
	except RequestsException:
		return None

#正則表示式提取資訊
def parse_one_page(html):
	pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
		+'.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',re.S)
	items = re.findall(pattern,html)
	for item in items:
		yield{
		'index':item[0],
		'image':item[1],
		'title':item[2],
		'actor':item[3].strip()[3:],
		'time':item[4].strip()[5:],
		'score':item[5]+item[6]
		}

def main():
	url = "https://maoyan.com/board/4"
	html = get_one_page(url,headers)
	for item in parse_one_page(html):
		print(item)

if __name__ == '__main__':
	main()

執行結果如下：

4.貓眼TOP100所有資訊寫入檔案

上邊程式碼實現單頁的資訊抓取，要想爬取100個電影的資訊，先觀察每一頁url的變化，點開每一頁我們會發現url進行變化，原url後面多了‘？offset=0’,且offset的值變化從0,10,20，變化如下：

程式碼實現如下：

#-*-coding:utf-8-*-
import requests
import re
import json
import os
from requests.exceptions import RequestException

#貓眼電影網站有反爬蟲措施，設定headers後可以爬取
headers = {
	'Content-Type': 'text/plain; charset=UTF-8',
	'Origin':'https://maoyan.com',
	'Referer':'https://maoyan.com/board/4',
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
	}

#爬取網頁原始碼
def get_one_page(url,headers):
	try:
		response =requests.get(url,headers =headers)
		if response.status_code == 200:
			return response.text
		return None
	except RequestsException:
		return None

#正則表示式提取資訊
def parse_one_page(html):
	pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
		+'.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',re.S)
	items = re.findall(pattern,html)
	for item in items:
		yield{
		'index':item[0],
		'image':item[1],
		'title':item[2],
		'actor':item[3].strip()[3:],
		'time':item[4].strip()[5:],
		'score':item[5]+item[6]
		}
#貓眼TOP100所有資訊寫入檔案
def write_to_file(content):
	#encoding ='utf-8',ensure_ascii =False,使寫入檔案的程式碼顯示為中文
	with open('result.txt','a',encoding ='utf-8') as f:
		f.write(json.dumps(content,ensure_ascii =False)+'\n')
		f.close()
#下載電影封面
def save_image_file(url,path):

	jd = requests.get(url)
	if jd.status_code == 200:
		with open(path,'wb') as f:
			f.write(jd.content)
			f.close()

def main(offset):
	url = "https://maoyan.com/board/4?offset="+str(offset)
	html = get_one_page(url,headers)
	if not os.path.exists('covers'):
		os.mkdir('covers')	
	for item in parse_one_page(html):
		print(item)
		write_to_file(item)
		save_image_file(item['image'],'covers/'+item['title']+'.jpg')

if __name__ == '__main__':
	#對每一頁資訊進行爬取
	for i in range(10):
		main(i*10)

爬取結果如下：

5.多執行緒抓取

進行比較，發現多執行緒爬取時間明顯較快：

多執行緒：

以下為完整程式碼：

#-*-coding:utf-8-*-
import requests
import re
import json
import os
from requests.exceptions import RequestException
from  multiprocessing import Pool
#貓眼電影網站有反爬蟲措施，設定headers後可以爬取
headers = {
	'Content-Type': 'text/plain; charset=UTF-8',
	'Origin':'https://maoyan.com',
	'Referer':'https://maoyan.com/board/4',
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
	}

#爬取網頁原始碼
def get_one_page(url,headers):
	try:
		response =requests.get(url,headers =headers)
		if response.status_code == 200:
			return response.text
		return None
	except RequestsException:
		return None

#正則表示式提取資訊
def parse_one_page(html):
	pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
		+'.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',re.S)
	items = re.findall(pattern,html)
	for item in items:
		yield{
		'index':item[0],
		'image':item[1],
		'title':item[2],
		'actor':item[3].strip()[3:],
		'time':item[4].strip()[5:],
		'score':item[5]+item[6]
		}
#貓眼TOP100所有資訊寫入檔案
def write_to_file(content):
	#encoding ='utf-8',ensure_ascii =False,使寫入檔案的程式碼顯示為中文
	with open('result.txt','a',encoding ='utf-8') as f:
		f.write(json.dumps(content,ensure_ascii =False)+'\n')
		f.close()
#下載電影封面
def save_image_file(url,path):

	jd = requests.get(url)
	if jd.status_code == 200:
		with open(path,'wb') as f:
			f.write(jd.content)
			f.close()

def main(offset):
	url = "https://maoyan.com/board/4?offset="+str(offset)
	html = get_one_page(url,headers)
	if not os.path.exists('covers'):
		os.mkdir('covers')	
	for item in parse_one_page(html):
		print(item)
		write_to_file(item)
		save_image_file(item['image'],'covers/'+item['title']+'.jpg')

if __name__ == '__main__':
	#對每一頁資訊進行爬取
	pool = Pool()
	pool.map(main,[i*10 for i in range(10)])
	pool.close()
	pool.join()

Python爬蟲之抓取貓眼電影TOP100

執行平臺：windowsPython版本：Python 3.7.0IDE:Sublime Text瀏覽器：Chrome瀏覽器思路： 1.檢視網頁原始碼 2.抓取單頁內容 3.正則表示式提取資訊

python爬蟲，爬取貓眼電影top100

import requests from bs4 import BeautifulSoup url_list = [] all_name = [] all_num = [] all_actor = [] all_score = [] class Product_url():

Python爬蟲之requests+正則表示式抓取貓眼電影top100以及瓜子二手網二手車資訊(四)

{'index': '1', 'image': 'http://p1.meituan.net/movie/[email protected]_220h_1e_1c', 'title': '霸王別姬', 'actor': '張國榮,張豐毅,鞏俐', 'time': '1993-01-01', 'sc

Python爬蟲之三：抓取貓眼電影TOP100

今天我要利用request庫和正則表示式抓取貓眼電影Top100榜單。執行平臺： Windows Python版本： Python3.6 IDE： Sublime Text 其他工具： Chrome瀏覽器 1. 抓取單頁內容瀏

Python爬蟲之一：抓取貓眼電影TOP100

執行平臺： Windows Python版本： Python3.6 IDE： Sublime Text 其他工具： Chrome瀏覽器1. 抓取單頁內容瀏覽器開啟貓眼電影首頁，點選“榜單”，然後再點選”TOP100榜”，就能看到想要的了。接下來通過程式碼來獲取網頁的HTML

python爬蟲實戰-爬取貓眼電影榜單top100

貓眼電影是靜態網頁,並且不需要驗證碼,非常適合爬蟲的入門練習,流程如下-通過url連接獲取html內容,在html中通過正則表示式,我們提取排名,名稱,主演,上映時間等資訊,格式如下["9", "魂斷藍橋", "主演：費雯·麗,羅伯特·泰勒,露塞爾·沃特森", "上映時間：1

python爬蟲——requests抓取某電影網站top100

今天閒的沒事，學習了一下爬蟲方面的知識，然後用requests庫實現了抓取貓眼網站top100電影，還是挺有意思的。最近用到python比較多，也算是加強了python的運用吧 :-） imp

python爬蟲之抓取代理伺服器IP

轉載請標明出處： http://blog.csdn.net/hesong1120/article/details/78990975 本文出自:hesong的專欄前言使用爬蟲爬取網站的資訊常常會遇到的問題是，你的爬蟲行為被對方識別了，對方把你的IP遮蔽了，返回

爬蟲練習 | 爬取貓眼電影Top100

#coding=utf-8 _date_ = '2018/12/9 16:18' import requests import re import json import time def get_one_page(url): headers={ 'User-Agent':'Mozil

【爬蟲】爬取貓眼電影top100

用正則表示式爬取 #!/usr/bin/python # -*- coding: utf-8 -*- import json # 快速匯入此模組：滑鼠先點到要匯入的函式處，再Alt + Enter進行選擇 from multiprocessing.pool im

【3月24日】Requests+正則表示式抓取貓眼電影Top100

本次實驗爬蟲任務工具較為簡單，主要是熟悉正則表示式的匹配： pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>

python爬蟲：爬取貓眼電影（分數的處理和多執行緒）

爬取用的庫是requests和beautifulsoup，程式碼編寫不難，主要是個別的細節處理需要注意 1、電影得分的處理右鍵審查元素，我們看到分數的整數部分和小數部分是分開的，在beautifulsoup中，我們可以用（.strings或者.stripped_stri

抓取貓眼電影top100

一、目標運用requests+正則表示式爬取貓眼電影top100的電影圖片、名稱、時間、評分等資訊，提取站點的url為"http://maoyan.com/board/4"，提取結果以文字的形式儲存下來。二、準備工作1. 安裝python 首先，下載Python3，這裡使用P

Python爬蟲實戰之Requests+正則表示式爬取貓眼電影Top100

import requests from requests.exceptions import RequestException import re import json # from multiprocessing import Pool # 測試了下這裡需要自己新增頭部否則得不到網頁 hea

Python-爬蟲-基本庫（requests）使用-抓取貓眼電影Too100榜

spa spi fire tools not agen ext get pytho 1 #抓取貓眼電影，https://maoyan.com/board/4 榜單電影列表 2 import requests 3 import re 4 from requests

python爬蟲爬取貓眼電影top100

這個爬蟲我是跟著教程做的，也是第一次用python的re和multiprocessing（多執行緒），還知道了yield生成器的用法。不過re正則表示式真的厲害，但是學起來比較難，還在學習中。import requests import re import pymysql f

Python爬蟲-爬取貓眼電影Top100榜單

貓眼電影的網站html組成十分簡單。地址就是很簡單的offset=x 這個x引數更改即可翻頁。下面的資訊使用正則表示式很快就可以得出結果。直接放程式碼： import json

python網路爬蟲例項：Requests+正則表示式爬取貓眼電影TOP100榜

一、前言最近在看崔慶才先生編寫的《Python3網路爬蟲開發實戰》這本書，學習了requests庫和正則表示式，爬取貓眼電影top100榜單是這本書的第一個例項，主要目的是要掌握requests庫和正則表示式在實際案例中的使用。二、開發環境執行平

python爬蟲實戰：利用pyquery爬取貓眼電影TOP100榜單內容-2

上次利用pyquery爬取貓眼電影TOP100榜單內容的爬蟲程式碼中點選開啟連結存在幾個不合理點。1. 第一個就是自定義的create_file（檔案存在判斷及建立）函式。我在後來的python檔案功能相關學習中，發現這個自定義函式屬於重複造輪子功能。因為 for data

python requests抓取貓眼電影

def res b- int nic status () tle proc 1. 網址：http://maoyan.com/board/4? 2. 代碼： 1 import json 2 from multiprocessing import Po

Python爬蟲之抓取貓眼電影TOP100

思路：

1.檢視網頁原始碼

二.抓取單頁內容

4.貓眼TOP100所有資訊寫入檔案

5.多執行緒抓取

以下為完整程式碼：

相關推薦