Python爬蟲爬取知乎小結

阿新 • • 發佈：2019-01-22

最近學習了一點網路爬蟲，並實現了使用python來爬取知乎的一些功能，這裡做一個小的總結。網路爬蟲是指通過一定的規則自動的從網上抓取一些資訊的程式或指令碼。我們知道機器學習和資料探勘等都是從大量的資料出發，找到一些有價值有規律的東西，而爬蟲則可以幫助我們解決獲取資料難的問題，因此網路爬蟲是我們應該掌握的一個技巧。

python有很多開源工具包供我們使用，我這裡使用了requests、BeautifulSoup4、json等包。requests模組幫助我們實現http請求，bs4模組和json模組幫助我們從獲取到的資料中提取一些想要的資訊，幾個模組的具體功能這裡不具體展開。下面我分功能來介紹如何爬取知乎。

模擬登入

要想實現對知乎的爬取，首先我們要實現模擬登入，因為不登入的話好多資訊我們都無法訪問。下面是登入函式，這裡我直接使用了知乎使用者fireling的登入函式，具體如下。其中你要在函式中的data裡填上你的登入賬號和密碼，然後在爬蟲之前先執行這個函式，不出意外的話你就登入成功了，這時你就可以繼續抓取想要的資料。注意，在首次使用該函式時，程式會要求你手動輸入captcha碼，輸入之後當前資料夾會多出cookiefile檔案和zhihucaptcha.gif，前者保留了cookie資訊，後者則儲存了驗證碼，之後再去模擬登入時，程式會自動幫我們填上驗證碼。


def login 
():
    url = 'http://www.zhihu.com'
    loginURL = 'http://www.zhihu.com/login/email'

    headers = {
        "User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:41.0) Gecko/20100101 Firefox/41.0',
        "Referer": "http://www.zhihu.com/",
        'Host': 'www.zhihu.com',
    }

    data = {
        'email' 
: '[email protected]',
        'password': '**************',
        'rememberme': "true",
    }
    global s
    s = requests.session()
    global xsrf
    if os.path.exists('cookiefile'):
        with open('cookiefile') as f:
            cookie = json.load(f)
        s.cookies.update(cookie)
        req1 = s.get(url, headers=headers)
        soup = BeautifulSoup(req1.text, "html.parser")
        xsrf = soup.find('input', {'name': '_xsrf', 'type': 'hidden'}).get('value')
        # 建立一個zhihu.html檔案,用於驗證是否登陸成功
        with open('zhihu.html', 'w') as f:
            f.write(req1.content)
    else:
        req = s.get(url, headers=headers)
        print req

        soup = BeautifulSoup(req.text, "html.parser")
        xsrf = soup.find('input', {'name': '_xsrf', 'type': 'hidden'}).get('value')

        data['_xsrf'] = xsrf

        timestamp = int(time.time() * 1000)
        captchaURL = 'http://www.zhihu.com/captcha.gif?=' + str(timestamp)
        print captchaURL

        with open('zhihucaptcha.gif', 'wb') as f:
            captchaREQ = s.get(captchaURL, headers=headers)
            f.write(captchaREQ.content)
        loginCaptcha = raw_input('input captcha:\n').strip()
        data['captcha'] = loginCaptcha
        print data
        loginREQ = s.post(loginURL, headers=headers, data=data)
        if not loginREQ.json()['r']:
            print s.cookies.get_dict()
            with open('cookiefile', 'wb') as f:
                json.dump(s.cookies.get_dict(), f)
        else:
            print 'login fail'

需要注意的是，在login函式中有一個全域性變數s=reequests.session()，我們用這個全域性變數來訪問知乎，整個爬取過程中，該物件都會保持我們的持續模擬登入。

獲取使用者基本資訊

知乎上每個使用者都有一個唯一ID，例如我的ID是marcovaldong，那麼我們就可以通過訪問地址 https://www.zhihu.com/people/marcovaldong 來訪問我的主頁。個人主頁中包含了居住地、所在行業、性別、教育情況、獲得的贊數、感謝數、關注了哪些人、被哪些人關注等資訊。因此，我首先介紹如何通過爬蟲來獲取某一個知乎使用者的一些資訊。下面的函式get_userInfo(userID)實現了爬取一個知乎使用者的個人資訊，我們傳遞給該使用者一個使用者ID，該函式就會返回一個 list，其中包含暱稱、ID、居住地、所在行業、性別、所在公司、職位、畢業學校、專業、贊同數、感謝數、提問數、回答數、文章數、收藏數、公共編輯數量、關注的人數、被關注的人數、主頁被多少個人瀏覽過等19個數據。


def get_userInfo(userID):
    user_url = 'https://www.zhihu.com/people/' + userID
    response = s.get(user_url, headers=header_info)
    # print response
    soup = BeautifulSoup(response.content, 'lxml')
    name = soup.find_all('span', {'class': 'name'})[1].string
    # print 'name: %s' % name
    ID = userID
    # print 'ID: %s' % ID
    location = soup.find('span', {'class': 'location item'})
    if location == None:
        location = 'None'
    else:
        location = location.string
    # print 'location: %s' % location
    business = soup.find('span', {'class': 'business item'})
    if business == None:
        business = 'None'
    else:
        business = business.string
    # print 'business: %s' % business
    gender = soup.find('input', {'checked': 'checked'})
    if gender == None:
        gender = 'None'
    else:
        gender = gender['class'][0]
    # print 'gender: %s' % gender
    employment = soup.find('span', {'class': 'employment item'})
    if employment == None:
        employment = 'None'
    else:
        employment = employment.string
    # print 'employment: %s' % employment
    position = soup.find('span', {'class': 'position item'})
    if position == None:
        position = 'None'
    else:
        position = position.string
    # print 'position: %s' % position
    education = soup.find('span', {'class': 'education item'})
    if education == None:
        education = 'None'
    else:
        education = education.string
    # print 'education: %s' % education
    major = soup.find('span', {'class': 'education-extra item'})
    if major == None:
        major = 'None'
    else:
        major = major.string
    # print 'major: %s' % major

    agree = int(soup.find('span', {'class': 'zm-profile-header-user-agree'}).strong.string)
    # print 'agree: %d' % agree
    thanks = int(soup.find('span', {'class': 'zm-profile-header-user-thanks'}).strong.string)
    # print 'thanks: %d' % thanks
    infolist = soup.find_all('a', {'class': 'item'})
    asks = int(infolist[1].span.string)
    # print 'asks: %d' % asks
    answers = int(infolist[2].span.string)
    # print 'answers: %d' % answers
    posts = int(infolist[3].span.string)
    # print 'posts: %d' % posts
    collections = int(infolist[4].span.string)
    # print 'collections: %d' % collections
    logs = int(infolist[5].span.string)
    # print 'logs: %d' % logs
    followees = int(infolist[len(infolist)-2].strong.string)
    # print 'followees: %d' % followees
    followers = int(infolist[len(infolist)-1].strong.string)
    # print 'followers: %d' % followers
    scantime = int(soup.find_all('span', {'class': 'zg-gray-normal'})[len(soup.find_all('span', {'class': 'zg-gray-normal'}))-1].strong.string)
    # print 'scantime: %d' % scantime

    info = (name, ID, location, business, gender, employment, position,
            education, major, agree, thanks, asks, answers, posts,
            collections, logs, followees, followers, scantime)
    return info

if __name__ == '__main__':
    login()
    userID = 'marcovaldong'
    info = get_userInfo(userID)
    print 'The information of ' + userID + ' is: '
    for i in range(len(info)):
        print info[i]

下圖是我的主頁的部分截圖，從上面可以看到這19個數據，下面第二張圖是終端上顯示的我的這19個數據，我們可以作個對照，看看是否全部抓取到了。這個函式我用了很長時間來除錯，因為不同人的主頁的資訊完整程度是不同的，如果你在使用過程中發現了錯誤，歡迎告訴我。

獲取某個答案的所有點贊者名單

這裡先來大概的分析一下整個流程。我們要知道，知乎上的每一個問題都有一個唯一ID，這個可以從地址中看出來，例如問題2015 年有哪些書你讀過以後覺得名不符實？的地址為 https://www.zhihu.com/question/38808048 ，其中38808048就是其ID。而每一個問題下的每一個答案也有一個唯一ID，例如該問題下的最高票答案2015 年有哪些書你讀過以後覺得名不符實？ - 餘悅的回答 - 知乎的地址連結為 https://www.zhihu.com/question/38808048/answer/81388411 ，末尾的81388411就是該答案在該問題下的唯一ID。不過我們這裡用到的不是這兩個ID，而是我們在抓取點贊者名單時的唯一ID，此ID的獲得方法是這樣：例如我們打算抓取如何評價《人間正道是滄桑》這部電視劇？ - 老編輯的回答 - 知乎的點贊者名單，首先開啟firebug，點選“5321 人贊同”時，firebug會抓取到一個“GET voters_profile”的一個包，把游標放在上面，會看到一個連結 https://www.zhihu.com/answer/5430533/voters_profile ，其中的5430533才是我們在抓取點贊者名單時用到的一個唯一ID。注意此ID只有在答案被贊過後才有。(在這安利一下《人間正道是滄桑》這部電視劇，該劇以楊立青三兄妹的恩怨情仇為線索，從大革命時期到解放戰爭，比較全面客觀的展現了國共兩黨之間的主義之爭，每一次看都會新的認識和體會。)

在拿到唯一ID後，我們用requests模組去get到知乎返回的資訊，其中有一個json語句，該json語句中包含點贊者的資訊。另外，我們在網頁上瀏覽點贊者名單時，一次只能看到20條，每次下拉到名單底部時又加載出20條資訊，再載入20條資訊時所用的請求地址也包含在前面的json語句中。因此我們需要從json語句中提取出點攢著資訊和下一個請求地址。在網頁上瀏覽點贊者名單時，我們可以看到點贊者的暱稱、頭像、獲得了多少贊同和感謝，以及提問和回答的問題數量，這裡我提取了每個點贊者的暱稱、主頁地址（也就是使用者ID）、贊同數、感謝數、提問數和回答數。關於頭像的提取，我會在下面的函式中實現。

在提取到點贊者名單後，我將者資訊儲存了以唯一ID命名的txt檔案中。下面是函式的具體實現。


Zhihu = 'http://www.zhihu.com'
def get_voters(ans_id):
    # 直接輸入問題id(這個id在點選“等人贊同”時可以通過監聽網路得到)，關注者儲存在以問題id命名的.txt檔案中
    login()
    file_name = str(ans_id) + '.txt'
    f = open(file_name, 'w')
    source_url = Zhihu + '/answer/' +str(ans_id) +'/voters_profile'
    source = s.get(source_url, headers=header_info)
    print source
    content = source.content
    print content    # json語句
    data = json.loads(content)   # 包含總贊數、一組點贊者的資訊、指向下一組點贊者的資源等的資料
    # 列印總贊數
    txt1 = '總贊數'
    print txt1.decode('utf-8')
    total = data['paging']['total']   # 總贊數
    print data['paging']['total']   # 總贊數
    # 通過分析，每一組資源包含10個點贊者的資訊（當然，最後一組可能少於10個），所以需要迴圈遍歷
    nextsource_url = source_url     # 從第0組點贊者開始解析
    num = 0
    while nextsource_url!=Zhihu:
        try:
            nextsource = s.get(nextsource_url, headers=header_info)
        except:
            time.sleep(2)
            nextsource = s.get(nextsource_url, headers=header_info)
        # 解析出點贊者的資訊
        nextcontent = nextsource.content
        nextdata = json.loads(nextcontent)
        # 列印每個點贊者的資訊
        # txt2 = '列印每個點贊者的資訊'
        # print txt2.decode('utf-8')
        # 提取每個點贊者的基本資訊
        for each in nextdata['payload']:
            num += 1
            print num
            try:
                soup = BeautifulSoup(each, 'lxml')
                tag = soup.a
                title = tag['title']    # 點贊者的使用者名稱
                href = 'http://www.zhihu.com' + str(tag['href'])    # 點贊者的地址
                # 獲取點贊者的資料
                list = soup.find_all('li')
                votes = list[0].string  # 點贊者獲取的贊同
                tks = list[1].string  # 點贊者獲取的感謝
                ques = list[2].string  # 點贊者提出的問題數量
                ans = list[3].string  # 點贊者回答的問題數量
                # 列印點贊者資訊
                string = title + '  ' + href + '  ' + votes + tks + ques + ans
                f.write(string + '\n')
                print string
            except:
                txt3 = '有點贊者的資訊缺失'
                f.write(txt3.decode('utf-8') + '\n')
                print txt3.decode('utf-8')
                continue
        # 解析出指向下一組點贊者的資源
        nextsource_url = Zhihu + nextdata['paging']['next']
    f.close()

注意，點贊者名單中會有匿名使用者，或者有使用者被登出，這時我們抓取不到此使用者的資訊，我這裡在txt檔案中添加了一句“有點贊者的資訊缺失”。

使用同樣的方法，我們就可以抓取到一個使用者的關注者名單和被關注者名單，下面列出了這兩個函式。但是關注者名單抓取函式有一個問題，每次使用其抓取大V的關注者名單時，當抓取到第10020個follower的時候程式就會報錯，好像知乎有訪問限制一般。這個問題，我還沒有找到解決辦法，希望有solution的告知一下。因為沒有看到有使用者關注10020+個人，因此抓取被關注者名單函式暫時未發現報錯。


def get_followees(username):
    # 直接輸入使用者名稱，關注者儲存在以使用者名稱命名的.txt檔案中
    followers_url = 'http://www.zhihu.com/people/' + username + '/followees'
    file_name = username + '.txt'
    f = open(file_name, 'w')
    data = s.get(followers_url, headers=header_info)
    print data  # 訪問伺服器成功，返回<responce 200>
    content = data.content  # 提取出html資訊
    soup = BeautifulSoup(content, "lxml")   # 對html資訊進行解析
    # 獲取關注者數量
    totalsen = soup.select('span[class*="zm-profile-section-name"]')
    total = int(str(totalsen[0]).split(' ')[4])     # 總的關注者數量
    txt1 = '總的關注者人數：'
    print txt1.decode('utf-8')
    print total
    follist = soup.select('div[class*="zm-profile-card"]')  # 記錄有關注者資訊的list
    num = 0 # 用來在下面顯示正在查詢第多少個關注者
    for follower in follist:
        tag =follower.a
        title = tag['title']    # 使用者名稱
        href = 'http://www.zhihu.com' + str(tag['href'])    # 使用者地址
        # 獲取使用者資料
        num +=1
        print '%d   %f' % (num, num / float(total))
        # Alist = follower.find_all(has_attrs)
        Alist = follower.find_all('a', {'target': '_blank'})
        votes = Alist[0].string  # 點贊者獲取的贊同
        tks = Alist[1].string  # 點贊者獲取的感謝
        ques = Alist[2].string  # 點贊者提出的問題數量
        ans = Alist[3].string  # 點贊者回答的問題數量
        # 列印關注者資訊
        string = title + '  ' + href + '  ' + votes + tks + ques + ans
        try:
            print string.decode('utf-8')
        except:
            print string.encode('gbk', 'ignore')
        f.write(string + '\n')

    # 迴圈次數
    n = total/20-1 if total/20.0-total/20 == 0 else total/20
    for i in range(1, n+1, 1):
        # if num%30 == 0:
          #   time.sleep(1)
        # if num%50 == 0:
          #   time.sleep(2)
        raw_hash_id = re.findall('hash_id(.*)', content)
        hash_id = raw_hash_id[0][14:46]
        _xsrf = xsrf
        offset = 20*i
        params = json.dumps({"offset": offset, "order_by": "created", "hash_id": hash_id})
        payload = {"method":"next", "params": params, "_xsrf": _xsrf}
        click_url = 'http://www.zhihu.com/node/ProfileFolloweesListV2'
        data = s.post(click_url, data=payload, headers=header_info)
        # print data
        source = json.loads(data.content)
        for follower in source['msg']:
            soup1 = BeautifulSoup(follower, 'lxml')
            tag =soup1.a
            title = tag['title']    # 使用者名稱
            href = 'http://www.zhihu.com' + str(tag['href'])    # 使用者地址
            # 獲取使用者資料
            num +=1
            print '%d   %f' % (num, num/float(total))
            # Alist = soup1.find_all(has_attrs)
            Alist = soup1.find_all('a', {'target': '_blank'})
            votes = Alist[0].string  # 點贊者獲取的贊同
            tks = Alist[1].string  # 點贊者獲取的感謝
            ques = Alist[2].string  # 點贊者提出的問題數量
            ans = Alist[3].string  # 點贊者回答的問題數量
            # 列印關注者資訊
            string = title + '  ' + href + '  ' + votes + tks + ques + ans
            try:
                print string.decode('utf-8')
            except:
                print string.encode('gbk', 'ignore')
            f.write(string + '\n')
    f.close()

提取使用者頭像

再往下就是抓取使用者頭像了，給出某個唯一ID，下面的函式自動解析其主頁，從中解析出該使用者頭像地址，抓取到圖片並儲存到本地檔案，檔案以使用者唯一ID命名。

def get_avatar(userId):
    url = 'https://www.zhihu.com/people/' + userId
    response = s.get(url, headers=header_info)
    response = response.content
    soup = BeautifulSoup(response, 'lxml')
    name = soup.find_all('span', {'class': 'name'})[1].string
    # print name
    temp = soup.find('img', {'alt': name})
    avatar_url = temp['src'][0:-6] + temp['src'][-4:]
    filename = 'pics/' + userId + temp['src'][-4:]
    f = open(filename, 'wb')
    f.write(requests.get(avatar_url).content)
    f.close()

結合其他函式，我們就可以抓取到某個答案下所有點贊者的頭像，某個大V所有followers的頭像等。

抓取某個問題的所有答案

給出某個唯一ID，下面的函式幫助爬取到該問題下的所有答案。注意，答案內容只抓取文字部分，圖片省略，答案儲存在txt檔案中，txt檔案以答主ID命名。


def get_answer(questionID):
    url = 'http://www.zhihu.com/question/' + str(questionID)
    data = s.get(url, headers=header_info)
    soup = BeautifulSoup(data.content, 'lxml')
    # print str(soup).encode('gbk', 'ignore')
    title = soup.title.string.split('\n')[2]    # 問題題目
    path = title
    if not os.path.isdir(path):
        os.mkdir(path)
    description = soup.find('div', {'class': 'zm-editable-content'}).strings    # 問題描述，可能多行
    file_name = path + '/description.txt'
    fw = open(file_name, 'w')
    for each in description:
        each = each + '\n'
        fw.write(each)
    # description = soup.find('div', {'class': 'zm-editable-content'}).get_text() # 問題描述
        # 呼叫.string屬性返回None（可能是因為有換行符在內的緣故）,呼叫get_text()方法得到了文字，但換行丟了
    answer_num = int(soup.find('h3', {'id': 'zh-question-answer-num'}).string.split(' ')[0]) # 答案數量
    num = 1
    index = soup.find_all('div', {'tabindex': '-1'})
    for i in range(len(index)):
        print ('Scrapying the ' + str(num) + 'th answer......').encode('gbk', 'ignore')
        # print ('正在抓取第' + str(num) + '個答案......').encode('gbk', 'ignore')
        try:
            a = index[i].find('a', {'class': 'author-link'})
            title = str(num) + '__' + a.string
            href = 'http://www.zhihu.com' + a['href']
        except:
            title = str(num) + '__匿名使用者'
        answer_file_name = path + '/' + title + '__.txt'
        fr = open(answer_file_name, 'w')
        try:
            answer_content = index[i].find('div', {'class': 'zm-editable-content clearfix'}).strings
        except:
            answer_content = ['作者修改內容通過後，回答會重新顯示。如果一週內未得到有效修改，回答會自動摺疊。']
        for content in answer_content:
            fr.write(content + '\n')
        num += 1

    _xsrf = xsrf
    url_token = re.findall('url_token(.*)', data.content)[0][8:16]
    # 迴圈次數
    n = answer_num/10-1 if answer_num/10.0-answer_num/10 == 0 else answer_num/10
    for i in range(1, n+1, 1):
        # _xsrf = xsrf
        # url_token = re.findall('url_token(.*)', data.content)[0][8:16]
        offset = 10*i
        params = json.dumps({"url_token": url_token, "pagesize": 10, "offset": offset})
        payload = {"method":"next", "params": params, "_xsrf": _xsrf}
        click_url = 'https://www.zhihu.com/node/QuestionAnswerListV2'
        data = s.post(click_url, data=payload, headers=header_info)
        data = json.loads(data.content)
        for answer in data['msg']:
            print ('Scrapying the ' + str(num) + 'th answer......').encode('gbk', 'ignore')
            # print ('正在抓取第' + str(num) + '個答案......').encode('gbk', 'ignore')
            soup1 = BeautifulSoup(answer, 'lxml')
            try:
                a = soup1.find('a', {'class': 'author-link'})
                title = str(num) + '__' + a.string
                href = 'http://www.zhihu.com' + a['href']
            except:
                title = str(num) + '__匿名使用者'
            answer_file_name = path + '/' + title + '__.txt'
            fr = open(answer_file_name, 'w')
            try:
                answer_content = soup1.find('div', {'class': 'zm-editable-content clearfix'}).strings
            except:
                answer_content = ['作者修改內容通過後，回答會重新顯示。如果一週內未得到有效修改，回答會自動摺疊。']
            for content in answer_content:
                fr.write(content + '\n')
            num += 1

資料庫存取資料

在完成了上面的這些功能後，下一步要做的是將使用者資訊儲存在資料庫中，方便資料的讀取使用。我剛剛接觸了一下sqlite3，僅僅實現了將使用者資訊儲存在表格中。


def get_followeesInfo_toDB(userID):
    # 準備好sqlite3資料庫，當抓取到資料時，加入表格中
    conn = sqlite3.connect("Zhihu.db")
    curs = conn.cursor()
    curs.execute("create table if not exists userinfo(name TEXT, ID TEXT PRIMARY KEY, location TEXT, business TEXT, "
                 "gender TEXT, employment TEXT, position TEXT, education TEXT, major TEXT, "
                 "agree INTEGER, thanks INTEGER, asks INTEGER, answers INTEGER, posts INTEGER, "
                 "collections INTEGER, logs INTEGER, followees INTEGER, followers INTEGER, "
                 "scantime INTEGER)")
    followees_url = 'http://www.zhihu.com/people/' + userID + '/followees'
    file_name = userID + '.txt'
    f = open(file_name, 'w')
    data = s.get(followees_url, headers=header_info)
    print data  # 訪問伺服器成功，返回<responce 200>
    content = data.content  # 提取出html資訊
    soup = BeautifulSoup(content, "lxml")  # 對html資訊進行解析
    # 獲取關注者數量
    totalsen = soup.select('span[class*="zm-profile-section-name"]')
    total = int(str(totalsen[0]).split(' ')[4])  # 總的關注者數量
    txt1 = '總的關注者人數：'
    print txt1.decode('utf-8')
    print total
    follist = soup.select('div[class*="zm-profile-card"]')  # 記錄有關注者資訊的list
    num = 0  # 用來在下面顯示正在查詢第多少個關注者
    for follower in follist:
        tag = follower.a
        title = tag['title']  # 使用者名稱
        href = 'http://www.zhihu.com' + str(tag['href'])  # 使用者地址
        # 獲取使用者資料
        num += 1
        print '%d   %f' % (num, num / float(total))
        # Alist = follower.find_all(has_attrs)
        Alist = follower.find_all('a', {'target': '_blank'})
        votes = Alist[0].string  # 點贊者獲取的贊同
        tks = Alist[1].string  # 點贊者獲取的感謝
        ques = Alist[2].string  # 點贊者提出的問題數量
        ans = Alist[3].string  # 點贊者回答的問題數量
        # 列印關注者資訊
        string = title + '  ' + href + '  ' + votes + tks + ques + ans
        try:
            print string.decode('utf-8')
        except:
            print string.encode('gbk', 'ignore')
        f.write(string + '\n')
        if title != '[已重置]':
            # 獲取該followee的基本資訊，存入資料庫表格
            print 'Analysising the data of this user...'
            ID = href[28:]
            try:
                curs.execute("insert or ignore into userinfo values (?, ?, ?, ?, ?, ?, ?, "
                             "?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", get_userInfo(ID))
            except:
                print "This user account's state is abnormal..."
        else:
            print 'This user account has been disabled...'
        # print get_userInfo(ID)

    # 迴圈次數
    n = total / 20 - 1 if total / 20.0 - total / 20 == 0 else total / 20
    for i in range(1, n + 1, 1):
        # if num%30 == 0:
        #   time.sleep(1)
        # if num%50 == 0:
        #   time.sleep(2)
        raw_hash_id = re.findall('hash_id(.*)', content)
        hash_id = raw_hash_id[0][14:46]
        _xsrf = xsrf
        offset = 20 * i
        params = json.dumps({"offset": offset, "order_by": "created", "hash_id": hash_id})
        payload = {"method": "next", "params": params, "_xsrf": _xsrf}
        click_url = 'http://www.zhihu.com/node/ProfileFolloweesListV2'
        data = s.post(click_url, data=payload, headers=header_info)
        # print data
        source = json.loads(data.content)
        for follower in source['msg']:
            soup1 = BeautifulSoup(follower, 'lxml')
            tag = soup1.a
            title = tag['title']  # 使用者名稱
            href = 'http://www.zhihu.com' + str(tag['href'])  # 使用者地址
            # 獲取使用者資料
            num += 1
            print '%d   %f' % (num, num / float(total))
            # Alist = soup1.find_all(has_attrs)
            Alist = soup1.find_all('a', {'target': '_blank'})
            votes = Alist[0].string  # 點贊者獲取的贊同
            tks = Alist[1].string  # 點贊者獲取的感謝
            ques = Alist[2].string  # 點贊者提出的問題數量
            ans = Alist[3].string  # 點贊者回答的問題數量
            # 列印關注者資訊
            string = title + '  ' + href + '  ' + votes + tks + ques + ans
            try:
                print string.decode('utf-8')
            except:
                print string.encode('gbk', 'ignore')
            f.write(string + '\n')
            if title != '[已重置]':
                # 獲取該followee的基本資訊，存入資料庫表格
                print 'Analysising the data of this user...'
                ID = href[28:]
                try:
                    curs.execute("insert or ignore into userinfo values (?, ?, ?, ?, ?, ?, ?, "
                             "?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", get_userInfo(ID))
                except:
                    print "This user account's state is abnormal..."
            else:
                print 'This user account has been disabled...'
            # print get_userInfo(ID)
    f.close()
    conn.commit()
    conn.close()

等熟悉了sqlite3的使用，我的下一步工作是抓取大量使用者資訊和使用者之間的follow資訊，嘗試著將大V間的follow關係進行視覺化。再下面的工作應該就是學習python的爬蟲框架scrapy和爬取微博了。

另外，在寫這篇部落格的時候我又重新測試了一下上面的這些函式，然後我再在火狐上訪問知乎時，系統提示“因為該賬戶過度頻繁訪問”而要求輸入驗證碼，看來知乎已經開始限制爬蟲了，這樣以來我們就需要使用一些反反爬蟲技巧了，比如控制訪問頻率等等，這個等以後有了系統的瞭解之後再作補充吧。

Python爬蟲爬取知乎小結

模擬登入

獲取使用者基本資訊

獲取某個答案的所有點贊者名單

提取使用者頭像

抓取某個問題的所有答案

資料庫存取資料

Python爬蟲爬取知乎小結

python爬蟲——爬取知乎上自己關注的問題

通過Python爬蟲爬取知乎某個問題下的圖片

python scrapy爬取知乎問題和收藏夾下所有答案的內容和圖片

爬蟲爬取知乎登陸後首頁

python requests 爬取知乎使用者資訊

python爬蟲——爬取知網體育學刊引證論文資訊

使用python爬蟲——爬取淘寶圖片和知乎內容

【Python資料分析】簡單爬蟲，爬取知乎神回覆

Python爬蟲（入門+進階）學習筆記 1-6 瀏覽器抓包及headers設定（案例一：爬取知乎）

Python 爬蟲-模擬登入知乎-爬取拉勾網職位資訊

Python爬蟲之爬取知乎帖子並儲存到mysql（以及遇到問題和解決方法）

Python爬蟲設定代理IP爬取知乎圖片

Scrapy分布式爬蟲打造搜索引擎（慕課網）--爬取知乎（二）

爬取知乎Python中文社區信息

教程+資源,python scrapy實戰爬取知乎最性感妹子的爆照合集(12G)!

python爬取知乎專欄使用者評論資訊

用python爬取知乎中的圖片

超簡易Scrapy爬取知乎問題，標籤的爬蟲

用於爬取知乎某個話題下的精華問題中所有回答的爬蟲

Python爬蟲爬取知乎小結

模擬登入

獲取使用者基本資訊

獲取某個答案的所有點贊者名單

提取使用者頭像

抓取某個問題的所有答案

資料庫存取資料

相關推薦