Python 爬蟲爬取微信文章

阿新 • • 發佈：2018-06-04

微信爬蟲爬取微信文章

爬取公眾號文章

搜狗微信平臺為入口地址：http://weixin.sogou.com/

--------------------------------------------------------------

搜索關鍵詞“科技”對比網址變化情況

查看網址http://weixin.sogou.com/weixin?type=2&query=%E7%A7%91%E6%8A%80&ie=utf8&s_from=input&_sug_=y&_sug_type_=&w=01019900&sut=1942&sst0=1528078622302&lkt=1%2C1528078622200%2C1528078622200

查看下一頁的網址http://weixin.sogou.com/weixin?query=%E7%A7%91%E6%8A%80&_sug_type_=&sut=1942&lkt=1%2C1528078622200%2C1528078622200&s_from=input&_sug_=y&type=2&sst0=1528078622302&page=2&ie=utf8&w=01019900&dr=1

查看第三頁網址http://weixin.sogou.com/weixin?query=%E7%A7%91%E6%8A%80&_sug_type_=&sut=1942&lkt=1%2C1528078622200%2C1528078622200&s_from=input&_sug_=y&type=2&sst0=1528078622302&page=3&ie=utf8&w=01019900&dr=1

-------------------------------------------------------------

對比上面網址看到type=2，page=頁碼，quary=關鍵詞

http://weixin.sogou.com/weixin?type=2&quary=關鍵詞&page=頁碼

測試http://weixin.sogou.com/weixin?type=2&quary=科技&page=2 訪問正常與上面第二個頁面網址打開的內容一致，也就是說簡化的網址能正常訪問

接著查看頁面源代碼，分析裏面的文章鏈接，並測試鏈接是否為正是鏈接（鏈接編碼是否一致）

<a data-z="art" target="_blank" id="sogou_vr_11002601_img_0" href="http://mp.weixin.qq.com/s?src=11&amp;timestamp=1528081388&amp;ver=917&amp;signature=qZMxpmvj4qn3CzMqGDpjC*GUmFni2E5rsy7kzRZmWSePCsFB5-EgQ4cZHzLaWrpHrtnz75U9mp85NkOS4VSzZwv-e8FZAGS1SoifHUdKstjiW7REiiv3ZKdk*-Q8xBZu&amp;new=1" uigs="article_image_0"><i></i><img src="http://img01.sogoucdn.com/net/a/04/link?appid=100520033&amp;url=http://mmbiz.qpic.cn/mmbiz_jpg/qV0mcHxAh1C5y2oDjZElicU9upoxrMsoBb8uzDspe62fMibIggVIUCLfwep8gc6IDCUuiab1XdmgiajxxtmMwrHOYg/0?wx_fmt=jpeg" onload="resizeImage(this,140,105)" onerror="errorImage(this)"></a>

構造提取規則pat='<a data-z="art".*?(http://.*?)"'

根據規則提取到的鏈接為：

http://mp.weixin.qq.com/s?src=11&amp;timestamp=1528081388&amp;ver=917&amp;signature=qZMxpmvj4qn3CzMqGDpjC*GUmFni2E5rsy7kzRZmWSePCsFB5-EgQ4cZHzLaWrpHrtnz75U9mp85NkOS4VSzZwv-e8FZAGS1SoifHUdKstjiW7REiiv3ZKdk*-Q8xBZu&amp;new=1

用瀏覽器打開此連接錯誤，說明提取的鏈接不能直接打開，鏈接內容與真實鏈接有差異，

技術分享圖片

用前面的頁面上直接點開文章，查看真實鏈接

https://mp.weixin.qq.com/s?src=11&timestamp=1528081388&ver=917&signature=qZMxpmvj4qn3CzMqGDpjC*GUmFni2E5rsy7kzRZmWSePCsFB5-EgQ4cZHzLaWrpHrtnz75U9mp85NkOS4VSzZwv-e8FZAGS1SoifHUdKstjiW7REiiv3ZKdk*-Q8xBZu&new=1

對比發現兩個鏈接的差異在於有沒有“amp；”，所以去掉這個就可以得到真實鏈接

可以用url.replace("amp;","")去掉對應的多余字符

在文章頁面中需要提取標題和文本內容

經過頁面源碼分析用下面規則提取相應內容

titlepat='var msg_title = "(.*?)";'

contentpat='id="js_content">(.*?)id="js_sg_bar"'

下面上代碼，代碼中的IP代理是註釋掉的，用免費IP代理很難成功，大多數地址不能正常使用，為了防止微信屏蔽本機地址，所以，加入處理時間等待time.sleep()。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re
import urllib.request
import time
import urllib.error

##模擬瀏覽器安裝headers
headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
urllib.request.install_opener(opener)
##設置列表用於存儲鏈接
listurl=[]

##定義代理服務器函數
#def use_proxy(proxy_addr,url):
#	try:
#		import urllib.request
#		proxy=urllib.request.ProxyHandler({'http':proxy_addr})
#		opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
#		urllib.request.install_opener(opener)
#		data=urllib.request.urlopen(url).read().decode('utf-8')
#		data=str(data)
#		return data
#	except urllib.error.URLError as e:
#		if hasattr(e,"code"):
#			print(e.code)
#		if hasattr(e,"reason"):
#			print(e.reason)
#		time.sleep(10)
#	except Exception as e:
#		print("exception"+str(e))
#		time.sleep(1)
		
##定義獲取頁面所有文章鏈接
def getlisturl(key,pagestart,pageend):
	try:
		page=pagestart
		keycode=urllib.request.quote(key)
#		pagecode=urllib.request.quote("&page")
		for page in range(pagestart,pageend+1):
			url="http://weixin.sogou.com/weixin?type=2&query="+keycode+"&page="+str(page)
			data1=urllib.request.urlopen(url).read().decode('utf-8')
			data1=str(data1)
			listurlpat='<a data-z="art".*?(http://.*?)"'
			listurl.append(re.compile(listurlpat,re.S).findall(data1))
			time.sleep(2)
		print("共獲取到"+str(len(listurl))+"頁")
		print("第2頁鏈接數"+str(len(listurl[1]))+"個")
		return listurl
	except urllib.error.URLError as e:
		if hasattr(e,"code"):
			print(e.code)
		if hasattr(e,"reason"):
			print(e.reason)
		time.sleep(10)
	except Exception as e:
		print("exception"+str(e))
		time.sleep(1)

##定義獲取文章內容
def getcontent(listurl):
	i = 0
	#設置本地文件中的開始html編碼
	html1 = '''
			<!DOCTYPE html>
			<html>
			<head>
			<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
			<title>微信文章頁面</title>
			</head>
			<body>
			'''
	fh=open("/home/urllib/test/1.html","wb")
	fh.write(html1.encode("utf-8"))
	fh.close()
	#再次以追加寫入的方式打開文件，以寫入對應文章內容
	fh=open("/home/urllib/test/1.html","ab")
	for i in range(0,len(listurl)):
		for j in range(0,len(listurl[i])):
			try:
				url=listurl[i][j]
				url=url.replace("amp;","")
				data=urllib.request.urlopen(url).read().decode('utf-8')
				data=str(data)
				titlepat='var msg_title = "(.*?)";'
				contentpat='id="js_content">(.*?)id="js_sg_bar"'
				title=re.compile(titlepat).findall(data)
				content=re.compile(contentpat,re.S).findall(data)
				#初始化標題與內容
				thistitle = "此次沒有獲取到"
				thiscontent= "此次沒有獲取到"
				#如果標題列表不為空，說明找到了標題，取列表第0個元素，即此次標題賦給變量thistitle
				if (title!=[]):
					thistitle = title[0]
				if (content!=[]):
					thiscontent = content[0]
				#將標題與內容匯總賦給變量dataall
				dataall = "<p>標題為:"+thistitle+"</p><p>內容為："+thiscontent+"</p><br>"
				fh.write(dataall.encode('utf-8'))
				print("第"+str(i)+"個網頁第"+str(j)+"次處理")
				time.sleep(1)
			except urllib.error.URLError as e:
				if hasattr(e,"code"):
					print(e.code)
				if hasattr(e,"reason"):
					print(e.reason)
				time.sleep(10)
			except Exception as e:
				print("exception"+str(e))
				time.sleep(1)	
				
	fh.close()
	html2='''</body>
	</html>
	'''
	fh=open("/home/urllib/test/1.html","ab")
	fh.write(html2.encode("utf-8"))
	fh.close()
key="科技"
#proxy="122.114.31.177:808"
pagestart=1
pageend=3
listurl=getlisturl(key,pagestart,pageend)
getcontent(listurl)

執行結果正常：

技術分享圖片

生成1.html 用瀏覽器打開，可以看到沒提取的多篇文章內容

技術分享圖片

現在代碼可以直接使用的（2018-06-03）

隨著時間推移，由於頁面源碼常會出現變化，正則匹配常常需要自行分析，參考上面代碼，調整正則表達式，即可提取內容。

Python 爬蟲爬取微信文章

微信爬蟲爬取微信文章爬取公眾號文章搜狗微信平臺為入口地址：http://weixin.sogou.com/ --------------------------------------------------------------搜索關鍵詞“科技”對比網址變化情況查看網址http://wei

Python 爬蟲爬取微信文章

Python 爬蟲爬取微信文章

Python資料爬蟲學習筆記（13）爬取微信文章資料

Python爬蟲：爬取微信文章

Python爬蟲(5) 借助搜狗搜索爬取微信文章

如何利用Python網絡爬蟲爬取微信朋友圈動態--附代碼（下）

爬取微信文章代碼

python+itchat 爬取微信好友資訊

代理爬取微信文章

php利用curl爬蟲爬取微信公眾號，防止ip封鎖

Python爬蟲實戰(三) — 微信文章爬蟲

Python 爬蟲爬取指定微信公眾號文章

【Python爬蟲】爬取微信公眾號文章資訊準備工作

[Python爬蟲] 之十五：Selenium +phantomjs根據微信公眾號抓取微信文章

23個Python爬蟲開源項目代碼：爬取微信、淘寶、豆瓣、知乎、微博等

Python爬蟲開源項目代碼，爬取微信、淘寶、豆瓣、知乎、新浪微博、QQ、去哪網等代碼整理

python 多線程方法爬取微信公眾號文章

微信PK10平臺開發與用python爬取微信公眾號文章

Python爬取微信公眾號歷史文章進行資料分析

推薦｜23個Python爬蟲開源專案程式碼：爬取微信、淘寶、豆瓣、知乎、微博等

python爬蟲：利用python爬取微信好友,獲得男女比例。

Python 爬蟲爬取微信文章

相關推薦