Scrapy爬取新浪微博移動版使用者首頁第一條微博

阿新 • • 發佈：2019-05-12

大家好，本月第一次更新。

最近找了一份關於爬蟲的實習工作，需要爬取較大量的資料，這時就發現通過自己編寫函式來實現爬蟲效率太慢了；於是又轉回來用scrapy，以前稍微學習了一下，這次剛好爬爬微博練練手，而後再使用部分資料生成詞雲。

本次爬取的是新浪微博移動端（https://m.weibo.cn/），爬取的資料是使用者微博首頁的第一條微博（如下圖），包括文字內容、轉發量、評論數、點贊數和釋出時間，還有使用者名稱和其所在地區（後面可以分析不同地區微博使用者的關心的熱點話題）。

一、分析網頁

獲取使用者微博入口url

瀏覽發現使用的是A使用jax渲染的網頁，微博資料（https://m.weibo.cn/api/container/getIndex?containerid=102803_ctg1_5088_-_ctg1_5088&openApp=0&since_id=1）儲存在json格式網頁中，所以思路是先通過微博資料得到使用者url（如下圖），再來爬取後續內容。

獲取第一條微博資料

也是使用了Ajax渲染的網頁，跟上面一樣找到網頁入口就行了。請求網址如下：

這樣看網址的話毫無規律可言，簡化後發現 https://m.weibo.cn/api/container/getIndex?containerid=1076032554757470就可以進入。而且containerid=107603（***），括號裡的數字剛好是使用者的id號，因此我們可以通過這個來創造網頁

獲取使用者所在地區

使用者所在地在其基本資料中，如下圖

地址為：

同樣進行簡化得到：https://m.weibo.cn/api/container/getIndex?containerid=230283（***）_-_INFO其中括號裡面是使用者id號。

通過以上分析可知，獲取使用者的 id 號是本次爬取資料的關鍵，只需要用 id 構成網址，後面的爬取就相對簡單了。下面是程式設計部分。

二、程式設計爬取

注：轉載程式碼請標明出處

首先通過命令列建立 scrapy 爬蟲。

scrapy startproject sinaweibo

genspider xxx(爬蟲名) xxx(所在域名)

items.py定義爬蟲欄位

import scrapy


class SinaweiboItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()       #使用者名稱
    first_news = scrapy.Field()     #首條微博
    dates = scrapy.Field()     #釋出時間 
    zhuanzai = scrapy.Field()       #轉載數
    comment = scrapy.Field()        #評論數
    agree = scrapy.Field()      #點贊數
    city = scrapy.Field()       #所在地區

編寫爬取程式碼

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from sinaweibo.items import SinaweiboItem
 4 import json
 5 import re
 6 import copy
 7 
 8 
 9 class WeibodiyuSpider(scrapy.Spider):
10     name = 'weibodiyu'  #爬蟲名
11     allowed_domains = ['m.weibo.cn']    #只在該域名內爬取
12     start_urls = ['https://m.weibo.cn/api/container/getIndex?containerid=102803_ctg1_4188_-_ctg1_4188&openApp=0&since_id=1'
13                   ]
14 
15     def parse1(self, response):
16         infos = json.loads(response.body)   #將內容轉為json物件
17         item = response.meta['item']    #利用meta方法傳入item
18         city = response.meta['city']    #傳入城市
19         try:
20             name = infos["data"]["cards"][0]["mblog"]["user"]["screen_name"]    #爬取名字
21             first_news = re.findall('([\u4e00-\u9fa5]+)', str(infos["data"]["cards"][0]["mblog"]["text"]), re.S)    #爬取微博內容，使用正則去除一些雜項如網頁程式碼
22             dates = infos["data"]["cards"][0]["mblog"]["created_at"]    #釋出時間
23             zhuanzai = infos["data"]["cards"][0]["mblog"]["reposts_count"]    #轉載數
24             comment = infos["data"]["cards"][0]["mblog"]["comments_count"]    #評論數
25             agree = infos["data"]["cards"][0]["mblog"]["attitudes_count"]    #點贊數
26             #將資料賦給item
27             item['name'] = name
28             item['first_news'] = first_news
29             item['dates'] = dates
30             item['zhuanzai'] = zhuanzai
31             item['comment'] = comment
32             item['agree'] = agree
33             item['city'] = city
34             return item    #返回
35         except IndexError or KeyError:
36             pass
37 
38     def parse2(self, response):    #獲取所在地區函式
39         infos = json.loads(response.body)
40         try:
41             item = response.meta['item']    #傳入item
42             city_cont = str(infos["data"]["cards"][1]["card_group"])
43             city = re.findall('card_type.*?所在地.*?item.*?:(.*?)}]', city_cont, re.S)[0].replace('\'', '').replace(
44                 ' ', '')    #城市
45             item['city'] = city
46             ids = response.meta['ids']    #傳入id並賦給ids變數
47             n_url1 = 'https://m.weibo.cn/api/container/getIndex?&containerid=107603' + ids
48             yield scrapy.Request(n_url1, meta={'item': item, 'city': copy.deepcopy(city)}, callback=self.parse1)    #執行完上述命令後的步驟
49         except IndexError or KeyError:
50             pass
51 
52     def parse(self, response):
53         datas = json.loads(response.body)
54         item = SinaweiboItem()
55         for i in range(0, 20):
56             try:
57                 ids = str(datas["data"]["cards"][i]["mblog"]["user"]["id"])    #獲取使用者id
58                 n_url2 = 'https://m.weibo.cn/api/container/getIndex?containerid=230283{}_-_INFO'.format(ids)
59                 yield scrapy.Request(n_url2, meta={'item': item, 'ids': copy.deepcopy(ids)}, callback=self.parse2)    #進入怕parse2函式執行命令
60             except IndexError or KeyError:
61                 pass
62         social_urls = [
63             'https://m.weibo.cn/api/container/getIndex?containerid=102803_ctg1_4188_-_ctg1_4188&openApp=0&since_id={}'.format(
64                 str(i)) for i in range(2, 100)]
65         celebritys_urls = [
66             'https://m.weibo.cn/api/container/getIndex?containerid=102803_ctg1_4288_-_ctg1_4288&openApp=0&since_id={}'.format(
67                 str(j)) for j in range(1, 100)]
68         hots_urls = ['https://m.weibo.cn/api/container/getIndex?containerid=102803&openApp=0&since_id={}'.format(str(t))
69                      for
70                      t in range(1, 100)]
71         urls = celebritys_urls + social_urls + hots_urls    #入口網址
72         for url in urls:
73             yield scrapy.Request(url, callback=self.parse)

這裡要注意 scrpay.Request 函式的meta引數，它是一個字典，用來進行引數傳遞，如上面程式碼所示，我想在parse2()函式中用到parse()函式中爬取的使用者id，就需要進行設定，這裡就不過多解釋了，其實我也是處於摸著石頭過河的理解程度，想深入瞭解的朋友可自行百度。

在setting.py配置爬蟲

這次我只將內容匯出到了csv檔案中，方便後續篩選製作詞雲，如果爬取的資料較多的話，可以儲存到資料庫中。

 1 BOT_NAME = 'sinaweibo'
 2 
 3 SPIDER_MODULES = ['sinaweibo.spiders']
 4 NEWSPIDER_MODULE = 'sinaweibo.spiders'
 5 
 6 USER_AGENT: 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'    #訊息頭
 7 DOWNLOAD_DELAY = 0.5    #延時0.5s
 8 # Crawl responsibly by identifying yourself (and your website) on the user-agent
 9 #USER_AGENT = 'sinaweibo (+http://www.yourdomain.com)'
10 FEED_URI = 'file:C:/Users/lenovo/Desktop/weibo.csv'    #存入檔案位置
11 FEED_FORMAT = 'csv'    #儲存格式
12 ITEM_PIPELINES= {'sinaweibo.pipelines.SinaweiboPipeline': 300}     #管道設定
13 # Obey robots.txt rules
14 ROBOTSTXT_OBEY = False
15 FEED_EXPORT_ENCODING = 'UTF8'   #編碼格式

本次沒有下載圖片及其他內容了，因此pipelines.py檔案就沒有編寫內容了。爬取的部分資料如下：

到這裡爬蟲部分就結束了，本次爬取的內容也較為簡單，下面的話就是使用其中的部分資料來生成詞雲。

詞雲製作

在檔案中新建了一個 weibo_analysis.py 的檔案，使用jieba庫來進行分詞，在此之前，需要先將所需資料提取出來，這裡使用pandas就可以。

這部分程式很簡單，就不廢話了，直接上程式碼：

 1 import csv
 2 import pandas as pd
 3 import jieba.analyse
 4 
 5 
 6 def get_ciyun(city):    #進行分詞
 7     tags=jieba.analyse.extract_tags(str(city),topK=100,withWeight=True)
 8     for item in tags:
 9         print(item[0]+'\t'+str(int(item[1]*1000)))
10 
11 
12 need_citys = ['北京', '上海', '湖南', '四川', '廣東']
13 beijing = []
14 shanghai = []
15 hunan = []
16 sichuan = []
17 gd = []
18 pd.set_option('expand_frame_repr', True)    #可換行顯示
19 pd.set_option('display.max_rows', None)    #顯示所有行
20 pd.set_option('display.max_columns', None)    #顯示所有列
21 df = pd.read_csv('C:\\Users\lenovo\Desktop\weibo.csv')    #讀取檔案內容並轉化為dataframes物件
22 
23 contents = df['first_news']    #取微博內容
24 city = df['city']    #取城市
25 for i in range(len(city)):
26     if need_citys[0] in city[i]:    #判斷並存入
27         beijing.append(contents[i])
28     elif need_citys[1] in city[i]:
29         shanghai.append(contents[i])
30     elif need_citys[2] in city[i]:
31         hunan.append(contents[i])
32     elif need_citys[3] in city[i]:
33         sichuan.append(contents[i])
34     elif need_citys[4] in city[i]:
35         gd.append(contents[i])
36     else:
37         pass
38 
39 #輸出
40 get_ciyun(beijing)
41 print('-'*20)
42 get_ciyun(shanghai)
43 print('-'*20)
44 get_ciyun(hunan)
45 print('-'*20)
46 get_ciyun(sichuan)
47 print('-'*20)
48 get_ciyun(gd)

本次是通過Tagul網站在製作詞雲，將上方輸出的詞頻匯入，選擇好詞雲形狀、字型（不支援中文可自行匯入中文字型包）、顏色等點選視覺化就能生成了，非常方便。

下面是我本次生成的詞雲圖片：

使用scrapy進行爬蟲確實能大大提高爬取效率，同時本次使用過程中也發現了許多的問題，如對這個框架不夠深入，還有很多方法不會用、不知道，還有就是Python本身的面向物件的知識理解的也不夠，需要繼續學習。這也說明了自己還只是一枚菜鳥。

若有疑問或建議，歡迎提出指正。

Scrapy爬取新浪微博移動版使用者首頁第一條微博

Scrapy爬取新浪微博移動版使用者首頁第一條微博

scrapy爬取新浪微博並存入MongoDB中

Python爬取新浪微博用戶信息及內容

關於爬取新浪微博，記憶體耗用過高的問題

python3[爬蟲實戰] 爬蟲之requests爬取新浪微博京東客服

用python寫網路爬蟲-爬取新浪微博評論

爬蟲爬取新浪微博

java 使用htmlunit模擬登入爬取新浪微博頁面

【python 新浪微博爬蟲】python 爬取新浪微博24小時熱門話題top500

java爬取新浪微博帶有“展開全文”的完整微博文字

WebCollector教程——爬取新浪微博

爬取新浪微博使用者的個人資訊和微博內容

[python爬蟲] Selenium爬取新浪微博內容及使用者資訊

爬取新浪微博評論及點贊數並存儲為excel的.csv格式

用python爬取新浪微博資料（無需手動獲取cookie)

Python爬取新浪微信評論，瞭解一下

requests, Beautifusoup 爬取新浪新聞資訊

Python 爬蟲實例（7）—— 爬取新浪軍事新聞

4-15 爬取新浪網

python 爬取新浪網站 NBA球員最近2個賽季庫裡前20場資料

Scrapy爬取新浪微博移動版使用者首頁第一條微博

相關推薦