scrapy安裝與資料抓取
-
scrapy安裝
pip install Scrapy
如果需要vs c++的話可能是因為要用到twisted,可以到 https://www.lfd.uci.edu/~gohlke/pythonlibs/ 下載,然後在本地下載的目錄下在位址列輸入cmd,然後pip install Twisted-18.7.0-cp37-cp37m-win_amd64.whl 來安裝。
No module named 'win32api'錯誤可以通過 pip install pypiwin32來解決
-
新建scrapy專案
在本地資料夾下,位址列cmd進入命令列,
scrapy startproject XXX (xxx專案名稱)
資料夾會有如下幾個檔案/資料夾(xxx)
scrapy.cfg
: 專案的配置檔案xxx/
: 該專案的python模組。之後您將在此加入程式碼。xxx/items.py
: 專案中的item檔案.xxx/pipelines.py
: 專案中的pipelines檔案.xxx/settings.py
: 專案的設定檔案.xxx/spiders/
: 放置spider程式碼的目錄.
然後可以將專案匯入到pycharm裡面了
-
爬取簡書網示例
1.在items.py裡定義item,相當於java的實體類。我一開始不知道是寫在這裡面的,自己在外面寫了一個,然後執行時候就找不到了。
這個檔案本來有示例的,為了方便檢視區別,我把整個貼進來了。JsItem是我們加上去的。
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class QuotesItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass class JsItem(scrapy.Item): # 類別 leibie = scrapy.Field() # 標題 biaoti = scrapy.Field() # 正文 zhengwen = scrapy.Field() # 字數 zishu = scrapy.Field() # 閱讀量 yuedu = scrapy.Field() # 評論數 pinglun = scrapy.Field() # 點贊 dianzan = scrapy.Field() # 最後編輯時間 shijian = scrapy.Field() # 作者 zuozhe = scrapy.Field() # 自定義id zid = scrapy.Field() # 原文地址 yuanwen = scrapy.Field() pass
這裡的 閱讀量、評論數、點贊數 我後面沒獲取到,可以忽略。
2.spider資料夾下編寫spider檔案
# -*- coding:utf-8 -*-
import uuid
import scrapy
from quotes.items import JsItem
# from . import jianshuDic
class ToScrapeSpiderXPath(scrapy.Spider):
name = 'jianshu'
start_urls = [
'https://www.jianshu.com/c/V2CqjW',
# 'https://www.jianshu.com/c/fcd7a62be697',
# 'https://www.jianshu.com/c/8c92f845cd4d', 'https://www.jianshu.com/c/yD9GAd',
# 'https://www.jianshu.com/c/1hjajt', 'https://www.jianshu.com/c/cc7808b775b4',
# 'https://www.jianshu.com/c/7b2be866f564', 'https://www.jianshu.com/c/5AUzod',
# 'https://www.jianshu.com/c/742422443ad3', 'https://www.jianshu.com/c/vHz3Uc',
# 'https://www.jianshu.com/c/70b8514fb442', 'https://www.jianshu.com/c/NEt52a',
# 'https://www.jianshu.com/c/bd38bd199ec6', 'https://www.jianshu.com/c/accb04610749',
# 'https://www.jianshu.com/c/dqfRwQ', 'https://www.jianshu.com/c/qqfxgN', 'https://www.jianshu.com/c/xYuZYD',
# 'https://www.jianshu.com/c/263e0ef8c3c3', 'https://www.jianshu.com/c/6fba5273f339',
# 'https://www.jianshu.com/c/ad41ba5abc09', 'https://www.jianshu.com/c/f6b4ca4bb891',
# 'https://www.jianshu.com/c/e50258a6a44b', 'https://www.jianshu.com/c/Jgq3Wc', 'https://www.jianshu.com/c/LLCyGH'
]
# https://blog.csdn.net/u014271114/article/details/53082676/
# https://www.tuicool.com/articles/jyQF32V
# https://www.jianshu.com/p/acdf9740ec79
# 儲存到資料庫 https://www.jianshu.com/p/acdf9740ec79
def parse(self, response):
for d in response.xpath('//ul[@class="note-list"]/li'):
# 獲取文章連結
pageurl = d.xpath('a/@href').extract_first()
if pageurl is not None:
link = 'http://www.jianshu.com'+pageurl
# item = self.load_item(response)
item = JsItem()
item['leibie']=response.xpath('//a[@class="name"]/text()').extract_first()
item['yuanwen'] = link
# print(item)
yield scrapy.Request(link,meta={'item':item},callback=self.parse_item)
def parse_item(self, response):
item = response.meta['item']
item['biaoti'] = response.xpath('//h1[@class="title"]/text()').extract_first()
item['shijian']=response.xpath('//span[@class="publish-time"]/text()').extract_first()
item['yuedu'] = response.xpath('//span[@class="views-count"]/text()').extract_first()
item['zishu'] = response.xpath('//span[@class="wordage"]/text()').extract_first()
item['pinglun'] = response.xpath('//span[@class="comments-count"]/text()').extract_first()
item['dianzan'] = response.xpath('//span[@class="likes-count"]/text()').extract_first()
item['zuozhe'] = response.xpath('//span[@class="name"]/a/text()').extract_first()
# 自定義id https://www.cnblogs.com/dkblog/archive/2011/10/10/2205200.html
item['zid'] =str(uuid.uuid1()).replace('-','')
# 正文
zw = ''
# zws = response.xpath('//div[@class="show-content-free"]/*').extract()
zws = response.xpath('//div[@class="show-content-free"]/descendant::p/descendant::text()').extract()
for i in zws:
zw += i+'\n'
item['zhengwen'] = zw
return item
這裡urls我註釋了很多,量太大不好測試,只留了一個。下拉到底的話網站會載入下一頁,這個我也還沒做。註釋的網址是我參考的一些,可以多看看。注意yield和return的區別。
parse方法我們獲取到了文章詳情的連結,然後再進一步請求,就如
https://www.tuicool.com/articles/jyQF32V 裡說(比如部落格或論壇,當前頁有標題、摘要和url,詳情頁面有完整內容)這種情況一樣,我們在parse的方法最後請求request了詳情頁連結,通過meta和callback到parse_item裡去接收這個請求(第二次請求)的response,並賦值給item。(這個腦回路我整了很久,莫名其妙就可以了,可能說的比較繞。)
正文的處理還要看後期要怎麼應用這些資料,從而決定是否保留那些標籤,圖片等。
3.setting配置
這個就相當於模擬瀏覽器去請求,
加標頭檔案
#這個改為false
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
'accept': 'image/webp,*/*;q=0.8',
'accept-language': 'zh-CN,zh;q=0.8',
'referer': 'https://www.jianshu.com/',
'user-agent': 'Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36',
}
4.輸出到資料庫
資料庫新建表,這個沒什麼說的了,我用的本地mysql
在pipelines.py裡設定資料庫內容,這裡參考
https://www.jianshu.com/p/acdf9740ec79
還是要注意class的命名,不然找不到了
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
def dbHandle():
conn = pymysql.connect(
host = "localhost",
user = "root",
passwd = "root",
charset = "utf8",
use_unicode = False
)
return conn
class QuotesPipeline(object):
def process_item(self, item, spider):
return item
class jianshuDB(object):
def process_item(self,item,spider):
dbObject = dbHandle()
cursor = dbObject.cursor()
cursor.execute("USE sale")
sql = "INSERT INTO t_jianshu(zid,leibie,biaoti,zhengwen,zishu,shijian,zuozhe,yuanwen) VALUES(%s,%s,%s,%s,%s,%s,%s,%s)"
try:
cursor.execute(sql,(item['zid'],item['leibie'],item['biaoti'],item['zhengwen'],item['zishu'],item['shijian'],item['zuozhe'],item['yuanwen']))
cursor.connection.commit()
except BaseException as e:
print("錯誤在這裡>>>>>>>>>>>>>",e,"<<<<<<<<<<<<<錯誤在這裡")
dbObject.rollback()
return item
然後再setting,py裡設定一下
#輸出到資料庫
ITEM_PIPELINES = {
'quotes.pipelines.jianshuDB': 300,
}
5.差點忘了說怎麼執行,在pycharm裡
其實不用輸出到資料庫也可以執行的,cfg檔案位於專案根目錄,在跟目錄新建一個py檔案,命名隨意start.py,內容如下,執行這個檔案就可以了。
# -*- coding:utf-8 -*-
from scrapy import cmdline
cmdline.execute("scrapy crawl jianshu".split())
這裡的scrapy crawl 是固定的了,jianshu對應2中spider檔案裡的name。
還有一些問題還沒妥善地處理,在這邊先記錄下。