Python實現人人網爬蟲，爬取使用者所有狀態資訊。

阿新 • • 發佈：2019-01-19

之前沒有怎麼用過python，也沒寫過爬蟲，最近幾天抽空學習了一下，寫了個人人網的爬蟲練了練手。

用了BeautifulSoup4包來解析HTML標籤，Beautiful Soup 是用 Python 寫的一個 HTML/XML 的解析器，它可以很好的處理不規範標記並生成剖析樹。通常用來分析爬蟲抓取的web文件。對於不規則的 Html文件，也有很多的補全功能，節省了開發者的時間和精力。使用起來類似於DOM，還是非常方便的。

在這裡先推薦兩個工具，Windows下的fiddler，Linux下基於Firefox的Firebug，前者我沒有具體用過，但基本上所有人都說是windows開發下利器，大家可以嘗試以下。至於Firebug，也就是類似於chrome的審查元素，只不過是基於Firefox瀏覽器的，功能還是相當完善的，Firebug如圖。

具體實現思路：首先我們知道人人網資料不公開，所以想要獲取資料，必須先登陸到人人網，那麼使用者是怎麼登陸的呢，其實寫過網頁的都知道所謂的登陸只是向伺服器的某個特定的連結發出了一個Post請求，這個請求裡包含了使用者登陸的資料，比如使用者名稱，密碼，以及其他一些可能的引數，我們只要能夠知道Post的物件，便可以模擬登陸了。

我們先進入到人人網，然後開啟瀏覽器的審查元素（或者Firebug，或者其他），開啟網路選項，輸入使用者名稱密碼之後，點選登陸，檢視所有抓取的包中型別為post的包，可以看到很多詳細的資訊。

同時我們還需要生成一個儲存cookie的東西，Python提供一個非常方便的自動儲存cookie的元件Cookielib。

還有一個問題就是，之所以要選擇3g.renren.com而不是www.renren.com，是因為考慮到網頁版人人在”所有狀態“的頁面內容是通過Ajax生成的，而且獲得的資料結構非常亂不管是用BeautifulSoup，還是用正則匹配都比較麻煩，所以我們選擇用比較簡單的手機版頁面來進行抓取。

關於Ajax頁面的抓取方法，可以通過PhantomJS+selenium來呼叫虛擬瀏覽器獲取載入好之後的網頁，也可以人工抓包，抓取有用資訊，然後通過GET操作獲得返回的頁面或者JSON檔案。這裡就不詳述了。

登陸程式碼如下：

try:
    self.cookie = cookielib.CookieJar() #設定登陸cookie
    self.cookieProc = urllib2.HTTPCookieProcessor(self.cookie)
except:
    raise
else:
    opener = urllib2.build_opener(self.cookieProc)
    opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11')]
     #瀏覽器偽裝
urllib2.install_opener(opener)
url='http://3g.renren.com/login.do' #登陸人人網3g首頁
postdata = {
    'email':self.email,
    'password':self.password,
    }
         
req = urllib2.Request(url,urllib.urlencode(postdata))
index = urllib2.urlopen(req).read()
indexSoup = BeautifulSoup(index)  #首頁BeautifulSoup物件，方便進行之後的標籤提取等操作

接下來我們開始獲取狀態資訊，有了cookie之後我們每次只需要通過BeautifulSoup找到對應節點的連結，然後把連結內容提出來，訪問，即可。最終目標頁面為http://3g.renren.com/status/getdoing.do?&sid=你的sid&id=你的id&curpage=頁數

其中的id和sid，是傳送get請求需要填寫的東西，這兩個值在返回的首頁html檔案裡面有，我們可以通過正則匹配把他們找出來。

對於連結的獲取，比如進入到個人主頁介面之後想點選狀態連結，就可以通過如下程式碼實現：

url = profileSoup.select(‘.sec’)[5].find_all(‘a’)[3]['href']

其中profileSoup為個人主頁頁面生成的BeautifulSoup物件，.select(‘.sec’)是指選擇所有class為sec的標籤，以list方式呈現，find_all(‘a’)表示選擇出所有的a標記，['href']表示獲取href下的連結。

程式碼中有很多是通過這樣”精確定位”的方式獲取的連結，大家可以針對頁面html檔案來仔細研究。

訪問到狀態頁面之後，我們便要開始獲取狀態資訊，分析頁面返回的html檔案，我們可以發現，每一頁上面有10個狀態，並且狀態都是以這樣的div呈現的：

<div>
    <a name="5222839593"></a>。。。真是獵奇啊。。轉自冷兔
    <p class="forward"><a href="http://3g.renren.com/profile.do?id=600002874&sid=cnJq1dPr67pvYbAtOSCa9n&q1m99i">冷兔</a>: 百度貼吧裡的獅子吧、老虎吧、獵豹吧甚至鬣狗吧互為天敵，而且時常爆發口炮大戰，激烈爭辯“究竟誰強”的話題，最後換取吧主批量封禁對方辯友帳號的結果，其中獅子吧和老虎吧更是老死不相往來，敵對程度遠勝中醫/反中醫、轉基因/反轉基因、甜黨/鹹黨等派別，而且戰火已經持續五年以上。（轉）歡迎新增冷兔微信id：lengtoo</p>
    <p class="time">7月17日 14:34 </p>
    <a href="http://3g.renren.com/status/replystatus.do?doingid=5222839593&id=303744791&ret=profile.do%3Fid%3D303744791%26amp%3Bhtf%3D2-n-%E6%88%91%E7%9A%84%E4%B8%AA%E4%BA%BA%E4%B8%BB%E9%A1%B5-n-0-u-status%2Fgetdoing.do%3F%26id%3D303744791%26htf%3D35%26sour%3Dprofile-n-%E6%88%91%E7%9A%84%E7%8A%B6%E6%80%81-n-0&sid=cnJq1dPr67pvYbAtOSCa9n&q1m99i&fr=l">回覆(3)</a> 
    <em>|</em> 
    <a href="http://3g.renren.com/status/forwardstatus.do?curpage=0&doingid=5222839593&fr=l&id=303744791&ret=profile.do%3Fid%3D303744791%26amp%3Bhtf%3D2-n-%E6%88%91%E7%9A%84%E4%B8%AA%E4%BA%BA%E4%B8%BB%E9%A1%B5-n-0-u-status%2Fgetdoing.do%3F%26id%3D303744791%26htf%3D35%26sour%3Dprofile-n-%E6%88%91%E7%9A%84%E7%8A%B6%E6%80%81-n-0&sid=cnJq1dPr67pvYbAtOSCa9n&q1m99i">分享</a> 
    <em>|</em> 
    <a href="http://3g.renren.com/status/wdelstatus.do?curpage=0&id=5222839593&ret=profile.do%3Fid%3D303744791%26amp%3Bhtf%3D2-n-%E6%88%91%E7%9A%84%E4%B8%AA%E4%BA%BA%E4%B8%BB%E9%A1%B5-n-0-u-status%2Fgetdoing.do%3F%26id%3D303744791%26htf%3D35%26sour%3Dprofile-n-%E6%88%91%E7%9A%84%E7%8A%B6%E6%80%81-n-0&sid=cnJq1dPr67pvYbAtOSCa9n&q1m99i">刪除</a>
   </div>

對比其他的標籤，我們可以發現：每一條狀態下對應的html程式碼都會有<p class=”time”>X月X日 xx:xx</p>這一條時間資訊，因此這個東西可以作為特徵值，來幫助我們找到每頁的10個描述狀態的div。

進一步我們可以觀察出以下規律：

第一個a標籤之後的文字為我們的狀態內容，如果文章為轉發，則為我們轉發之後回覆的話。

而對於轉發的文章，<p class=‘forward’>下的子節點<a>之後的內容為我們轉發的原文。

非轉發的文章，沒有<p class=’forward’>標籤。

頁面爬取完成之後，需要獲取下一頁連結標籤，而該<a>標籤的父節點的css樣式為”class=l”，於是可以幫助我呢精確定位。

totalPageHtml = statusSoup.select(".gray")[0].contents
totalPage = re.findall(r"(?<=/)\d+(?=[^\d])",str(totalPageHtml))　#正則匹配，找到總共的頁數
totalPage = int(totalPage[0])
print "總共有:"+str(totalPage)+"頁"
         
        nowPage = 1
        while (nowPage<=totalPage) :　＃迴圈，直到爬取完所有頁面的資訊
            print "當前正在獲取第"+str(nowPage)+"頁狀態資訊"
            statusList = statusSoup.select(".list")[0].children
            for child in statusList:
                if (child.select(".time")):# Step1: 找時間戳，確定為狀態資訊，而不是子節點中不是狀態資訊的div
                    statusDate.append(child.select(".time")[0].string)
                    if (child.select(".forward")): #step2:找class名為forward的內容，這部分為轉的狀態
                        tempStr = str(child.a.next_element)
                        m = re.findall(r"^.*?(?=轉自)",tempStr) #這裡是把"轉自"後面的內容刪除掉，只保留自己回覆的內容                          
                        if m:
                            statusContent.append(m[0])
                        else :
                            statusContent.append("無")
 
                        originStatusContent.append(child.select(".forward")[0].a.next_element.next_element)
                    else:   #step3:這些是原創內容，直接儲存
                       statusContent.append(child.a.next_element)
                       originStatusContent.append("無")                   
            nowPage = nowPage+1
            if (nowPage>totalPage): break
            nextPageUrl =str(statusSoup.select(".l")[0].a['href']) #查詢下一頁URL並跳轉
            req = urllib2.Request(nextPageUrl)
            statusFile = urllib2.urlopen(req).read()
            statusSoup = BeautifulSoup(statusFile)

這裡再推薦一個Ubuntu下的正則表示式軟體kiki，比較小巧方便，直接apt-get install kiki就可以了。

至此，我們就完成了所有的爬取工作，下面附上完整程式碼，實現功能是控制檯輸入使用者名稱和密碼，爬取完成之後儲存到UserData/使用者id 檔案中。

#! /usr/lib/python
#-*- coding: utf-8 -*-
import time
import sys
import urllib
import urllib2
import cookielib
import os
import re
from bs4 import BeautifulSoup
 
#import js2html
class renrenSpider:   
 
    def __init__(self,email,password):
        self.email = email
        self.password = password
        self.domain = 'renren.com'
        self.id = ''
        self.sid = ''
        try:
            self.cookie = cookielib.CookieJar()
            self.cookieProc = urllib2.HTTPCookieProcessor(self.cookie)
        except:
            raise
        else:
            opener = urllib2.build_opener(self.cookieProc)
            opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11')]
            urllib2.install_opener(opener)
     
    def login(self):
        url='http://3g.renren.com/login.do' #登陸人人網3g首頁
        postdata = {
                    'email':self.email,
                    'password':self.password,
                    }
         
        req = urllib2.Request(url,urllib.urlencode(postdata))
        index = urllib2.urlopen(req).read()
        indexSoup = BeautifulSoup(index)  #首頁soup
                     
        indexFile = open('index.html','w')
        indexFile.write(indexSoup.prettify())
        indexFile.close()
        tmp = indexSoup.select('.cur')[0]
        idHref =  tmp.parent.contents[2]['href']
        m = re.findall(r"\d{6,}",str(idHref)) #正則匹配獲取使用者的id
        self.id = m[0]
        print "使用者ID為：" + self.id
 
        m = re.findall(r"(?<=sid=).*?(?=&)",str(idHref)) #正則匹配獲取使用者的sid
        self.sid = m[0]
        print "使用者的sid為:" + self.sid
 
    def getStatus(self):
        #獲取個人狀態頁面
 
        statusDate = [] #儲存狀態時間
        statusContent = [] #儲存狀態內容
        originStatusContent = []  #儲存轉發原文內容
        url = 'http://3g.renren.com/profile.do' #登陸到個人主頁
        profileGetData = {
                          'id':str(self.id),
                          'sid':self.sid 
                         }
        req = urllib2.Request(url,urllib.urlencode(profileGetData))
        profile = urllib2.urlopen(req).read()
        profileSoup = BeautifulSoup(profile)
       # print profileSoup.prettify()
        url = profileSoup.select('.sec')[5].find_all('a')[3]['href']    #獲得連結
 
        req = urllib2.Request(url)
        statusFile = urllib2.urlopen(req).read()
        statusSoup = BeautifulSoup(statusFile)
        statusFile = open("status.html",'w')
        statusFile.write( statusSoup.prettify())
        statusFile.close()
         
        totalPageHtml = statusSoup.select(".gray")[0].contents
        totalPage = re.findall(r"(?<=/)\d+(?=[^\d])",str(totalPageHtml))
        totalPage = int(totalPage[0])
        print "總共有:"+str(totalPage)+"頁"
         
        nowPage = 1
       # totalPage = 15
        while (nowPage<=totalPage) :
            print "當前正在獲取第"+str(nowPage)+"頁狀態資訊"
            statusList = statusSoup.select(".list")[0].children
            for child in statusList:
                #statusNum = statusNum+1
                if (child.select(".time")):# Step1: 找時間戳，確定為狀態資訊
                    statusDate.append(child.select(".time")[0].string)
                    if (child.select(".forward")): #step2:找class名為forward的內容，這部分為轉的狀態
                        tempStr = str(child.a.next_element)
                        m = re.findall(r"^.*?(?=轉自)",tempStr)                           
                        if m:
                            statusContent.append(m[0])
                        else :
                            statusContent.append("無")
 
                        originStatusContent.append(child.select(".forward")[0].a.next_element.next_element)
                    else:   #step3:這些是原創內容，直接儲存
                       statusContent.append(child.a.next_element)
                       originStatusContent.append("無")                   
            nowPage = nowPage+1
            if (nowPage>totalPage): break
            nextPageUrl =str(statusSoup.select(".l")[0].a['href']) #查詢下一頁URL並跳轉
            req = urllib2.Request(nextPageUrl)
            statusFile = urllib2.urlopen(req).read()
            statusSoup = BeautifulSoup(statusFile)                           
           # for state in statusList:
           #     print state.name    
         
        finalFile = open("UserData/"+self.id,"w")
        for i in range (0,len(statusDate)):
            finalFile.write("第"+str(i+1)+"條:"+"\n")
            finalFile.write("時間:"+str(statusDate[i])+"\n")
            finalFile.write("狀態："+str(statusContent[i])+"\n")
            finalFile.write("轉發原文："+str(originStatusContent[i])+"\n")
            finalFile.write("\n")
 
if __name__ == '__main__':
    email = raw_input("輸入人人網賬號")
    password = raw_input("輸入人人網密碼")
    reload(sys)
    sys.setdefaultencoding('utf-8') 
    renrenLogin = renrenSpider(email,password)
    renrenLogin.login()
    renrenLogin.getStatus()

Python實現人人網爬蟲，爬取使用者所有狀態資訊。

Python實現人人網爬蟲，爬取使用者所有狀態資訊。

python實戰之網路爬蟲（爬取新聞內文資訊）

java實現爬蟲，爬取網易歌單資訊

python爬蟲，爬取豆瓣電影《芳華》電影短評，分詞生成雲圖。

python爬蟲，爬取貓眼電影top100

用JAVA實現一個爬蟲，爬取知乎的上的內容（程式碼已無法使用）

【Python資料分析】簡單爬蟲，爬取知乎神回覆

爬蟲，爬取鏈家網北京二手房資訊

爬蟲，爬取句子迷《龍族》

我的第一個爬蟲，爬取北京地區短租房信息

python實戰之網路爬蟲（爬取網頁新聞資訊列表）

使用python-requests+Fiddler4+appium爬蟲,批量爬取抖音小視訊

Python爬蟲專案--爬取某寶男裝資訊

(轉)python爬蟲例項——爬取智聯招聘資訊

用一個小小小爬蟲，爬取淘寶寶貝評價內容

Swaggy教你用python實現NBA資料統計的爬取

自制爬蟲，爬取分類總閱讀量，總評論量。全部文章閱讀量和，以及評論量和。但是發現數據不對

java實現簡單的網路爬蟲（爬取電影天堂電影資訊）

python爬蟲例項——爬取智聯招聘資訊

python實現百度VIP音樂爬取

Python實現人人網爬蟲，爬取使用者所有狀態資訊。

相關推薦