python爬蟲實戰---今日頭條的圖片抓取

阿新 • • 發佈：2019-01-29

本文是主要在今日頭條裡面的以“街拍路人”為搜尋條件去提取網頁的圖片和標題，並把標題當做資料夾的名稱，建立該資料夾，把圖片儲存到相應的資料夾下。

匯入庫

from urllib.parse import urlencode---把字典裡面的資料拼接成如下字串格式：

urlencode()的方法接受引數形式為：[(key1,value),(key2,value2),.....]或者可以是字典的形式：{‘key’：‘value1’,‘key2’：‘value2’,....}返回格式為key1=value1&key2=value2的字串。

from urllib.error import HTTPError---進行異常處理

from multiprocessing.pool import Pool---執行緒池

import requests---網頁請求

import os---檔案操作

import re---正則表示式的使用

爬取的資料（網頁）

這是其中的一部分將要提取的內容，其中的圖片和標題是我們提取的重點

網頁分析

從分析來看，該資料是通過非同步載入AJAX，開啟瀏覽器餓netword點選XHR再重新整理網頁，就可以看到格式為json的資料，點選其中一個再點選preview就可以發現在data列表裡面有我們想要的資料，裡面的title就是我們要抓取的資料標題，image_list列表的資料就是可以找到我們想要的圖片的連結。每次我們只要請求這個url就可以了，且這個url只有offset在變化，每次只需要傳入offset的值即可。

獲取資料轉換為json格式

def get_page(offset):
	# 設定請求頭
	headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
	# 設定請求引數
	data={
	'offset':offset,
	'format':'json',
	'keyword':'街拍路人',
	'autoload':'true',
	'count':'20',
	'cur_tab':'1',
	'from':'search_tab'
	}
	# 把data字典裡面的資料轉換為字串的格式，再去拼接字串
	url='https://www.toutiao.com/search_content/?'+urlencode(data)
	try:
		response=requests.get(url,headers=headers)
		# 判斷請求的狀態（200表示請求成功）
		if response.status_code == 200:
			# 將結果轉換成json格式再返回
			return response.json()
	except HTTPError:
		return None

資料解析

def get_images(offset):
	json=get_page(offset)
	if json !=None:
		if json.get('data'):
			for item in json.get('data'):
				title=item.get('title')
				images=item.get('image_list')
				if images:
					for img in images:
						yield{
						'image':img.get('url'),
						'title':title
						}

儲存圖片

首先根據提取到的標題去建立該資料夾，在拼接請求的url，然後對url進行處理，把url如//p3.pstatp.com/list/pgc-image/15293938122252e5eb4198a中的15293938122252e5eb4198a作為圖片的名稱。

def save_images(item):
	# 判斷該資料夾是否存在，不存在則建立
	if not os.path.exists(item.get('title')):
		os.mkdir(item.get('title'))

	# 獲取圖片連結，再去下載圖片
	try:
		headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
		# 字串拼接
		reponse=requests.get('http:'+item.get('image'),headers=headers)
		if reponse.status_code == 200:
			# 正則表示式匹配url後面的數字如//p3.pstatp.com/list/pgc-image/15293938122252e5eb4198a中的15293938122252e5eb4198a
			image_name=re.match('//p[0-9].*?(list/|list/pgc-image/)(.*)',item.get('image')).group(2)
			if 'pgc-image/' in image_name:
				# 字串切片
				image_name=image_name[10:]
			elif re.search('[0-9]*/(.*)',image_name):
				image_name=re.search('[0-9]*/(.*)',image_name).group(1)
			# 圖片的路徑+名稱
			file_path='{0}/{1}.{2}'.format(item.get('title'),image_name,'jpg')
			# 判斷是否存在該圖片了
			if not os.path.exists(file_path):
				with open(file_path,'wb') as f:
					f.write(reponse.content)
			else:
				print('該圖片已經下載了！')
	except HTTPError:
		print('儲存圖片失敗')

執行緒池

if __name__=='__main__':
	# 建立執行緒池
	pool=Pool()
	offset=([x*20 for x in range(start,end)])
	# 第一個引數是函式，第二個引數是一個迭代器，把迭代器的數字作為引數傳進去
	pool.map(main,offset)
	# 關閉執行緒池
	pool.close()
	# 主執行緒阻塞等待子執行緒的退出
	pool.join()

python爬蟲實戰---今日頭條的圖片抓取

匯入庫

爬取的資料（網頁）

網頁分析

獲取資料轉換為json格式

資料解析

儲存圖片

執行緒池

實現結果

完整程式碼

python爬蟲實戰---今日頭條的圖片抓取

Python爬蟲實戰：使用Selenium抓取QQ空間好友說說

[Python][爬蟲03]requests+BeautifulSoup例項:抓取圖片並儲存

Python爬蟲實戰詳解：爬取圖片之家

【Python爬蟲實戰專案一】爬取大眾點評團購詳情及團購評論

Python爬蟲實戰 requests+beautifulsoup+ajax 爬取半次元Top100的cos美圖

python3 爬蟲實戰：用 Appium 抓取手機 app 微信的資料

python2.7爬蟲實戰（房地產資訊抓取）

python爬蟲之利用scrapy框架抓取新浪天氣資料

python 爬蟲實戰（一）爬取豆瓣圖書top250

python爬蟲實戰：利用pyquery爬取貓眼電影TOP100榜單內容-2

python爬蟲學習之起點小說抓取

Python爬蟲實戰（1）——百度貼吧抓取帖子並儲存內容和圖片

Python爬蟲實戰專案2 | 動態網站的抓取（爬取電影網站的資訊）

Python爬蟲實戰：抓取並儲存百度雲資源（附程式碼）

python爬取今日頭條圖片

Python爬蟲入門教程 18-100 煎蛋網XXOO圖片抓取

Python爬蟲實戰一之使用Beautiful Soup抓取百度招聘資訊並存儲excel檔案

python爬蟲實戰（四）：selenium爬蟲抓取阿里巴巴採購批發商品

Python爬蟲實戰之抓取淘寶MM照片（一）

python爬蟲實戰---今日頭條的圖片抓取

匯入庫

爬取的資料（網頁）

網頁分析

獲取資料轉換為json格式

資料解析

儲存圖片

執行緒池

實現結果

完整程式碼

相關推薦