爬取伯樂在線文章（四）將爬取結果保存到MySQL

阿新 • • 發佈：2018-11-12

-a 邏輯 inf url cti dba image png post

Item Pipeline

當Item在Spider中被收集之後，它將會被傳遞到Item Pipeline，這些Item Pipeline組件按定義的順序處理Item。

每個Item Pipeline都是實現了簡單方法的Python類，比如決定此Item是丟棄而存儲。以下是item pipeline的一些典型應用：

驗證爬取的數據(檢查item包含某些字段，比如說name字段)
查重(並丟棄)
將爬取結果保存到文件或者數據庫中

編寫item

在items.py中進行編寫

class JobBoleArticleItem(scrapy.Item):
    title  
= scrapy.Field()
    create_date = scrapy.Field()
    praise_num = scrapy.Field()
    collect_num = scrapy.Field()
    comment_num = scrapy.Field()
    front_image_url = scrapy.Field()

編寫之後在提取文章邏輯裏面進行實例化

    def parse_detail(self,response):
        print("目前爬取的URL是："+response.url)
         
#提取文章的具體邏輯
        article_item = JobBoleArticleItem()
        front_image_url = response.meta.get("front_image_url", "")
        #  獲取文章標題
        title = response.css(‘.entry-header h1::text‘).extract()[0]
        #  獲取發布日期
        create_date = response.css(‘.entry-meta .entry-meta-hide-on-mobile::text 
‘).extract()[0].strip().replace("·", "")
        #  獲取點贊數
        praise_num = response.css(‘.vote-post-up h10::text‘).extract()[0]
        #  獲取收藏數
        collect_num = response.css(‘.post-adds .bookmark-btn::text‘).extract()[0].split(" ")[1]
        collect_match_re = re.match(r‘.*?(\d+).*‘, collect_num)
        if collect_match_re:
            collect_num = int(collect_match_re.group(1))
        else:
            collect_num = 0
        #  獲取評論數
        comment_num = response.css(‘.post-adds .hide-on-480::text‘).extract()[0]
        comment_match_re = re.match(r‘.*?(\d+).*‘, comment_num)
        if comment_match_re:
            comment_num = int(comment_match_re.group(1))
        else:
            comment_num = 0

        content = response.css(‘div.entry‘).extract()[0]

        article_item["title"] = title
        article_item["create_date"] =create_date
        article_item["praise_num"] = praise_num
        article_item["collect_num"] = collect_num
        article_item["comment_num"] = comment_num
        article_item["front_image_url"] = front_image_url

        yield article_item

最後調用yield article_item之後，article_item會傳遞到pipelines.py裏面

編寫pipelines

在pipelines.py文件中模板已經寫好，但是如果要使之生效，需要修改settings.py文件，將ITEM_PIPELINES的註釋去掉

技術分享圖片

在pipelines.py裏面打斷點進行調試，看article_item是否能傳遞盡來

技術分享圖片

如何將圖片保存到本地

繼續修改item，scrapy提供了一些方法，方便快速開發，修改settings.py

ITEM_PIPELINES = {
    ‘EnterpriseSpider.pipelines.EnterprisespiderPipeline‘: 300,
    ‘scrapy.pipelines.images.ImagesPipeline‘: 1,
}
IMAGES_URLS_FIELD = "front_image_url"
project_dir = os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE = os.path.join(project_dir, "images")

‘scrapy.pipelines.images.ImagesPipeline‘: 1-------設置scrapy自帶的普票保存方法，後面設置數字是流經管道的順序，數字小的先流經

IMAGES_URLS_FIELD = "front_image_url"------從item中提取圖片的URL，前面的IMAGES_URLS_FIELD是固定寫法

project_dir = os.path.abspath(os.path.dirname(__file__))：獲取當前項目的路徑

IMAGES_STORE = os.path.join(project_dir, "images")：設置圖片存儲的路徑

此時運行我們的main看是否能將圖片保存

技術分享圖片

報錯，沒有PIL這個模塊，這個是與圖片文件相關的庫，我們沒有按照，所以報錯，在虛擬環境中安裝PIL模塊

(scrapyenv) E:\Python\Envs>pip install -i https://pypi.douban.com/simple pillow

技術分享圖片

安裝之後重新運行程序，此時又報另一個錯誤

技術分享圖片

這個因為item傳遞到pipline的時候，下面的front_image_url 會被當做數組處理，但是我們在業務邏輯處理時候只是把他當做一個值進行處理

IMAGES_URLS_FIELD = "front_image_url"

修改業務處理邏輯

        article_item["title"] = title
        article_item["create_date"] =create_date
        article_item["praise_num"] = praise_num
        article_item["collect_num"] = collect_num
        article_item["comment_num"] = comment_num
        article_item["front_image_url"] = [front_image_url]

        yield article_item

修改完之後在運行程序，此時爬取的圖片成功保存到images文件夾下面

既然圖片已經保存到本地了，那麽是否可以提取出路徑，是否能把item裏面的front_image_url與本地路徑綁定在一起，此時我們需要定義一個自己pipeline，重載ImagesPipeline 中的item_completed

方法。

class ArticleImagePipeline(ImagesPipeline):
    def item_completed(self, results, item, info):
        pass

此時在修改settings.py文件，設置問我們自定義的圖片處理pipeline

ITEM_PIPELINES = {
    ‘EnterpriseSpider.pipelines.EnterprisespiderPipeline‘: 300,
   # ‘scrapy.pipelines.images.ImagesPipeline‘: 1,
    ‘EnterpriseSpider.pipelines.ArticleImagePipeline‘: 1,
}

打斷點進行調試

技術分享圖片

重寫item_completed方法

class ArticleImagePipeline(ImagesPipeline):
    def item_completed(self, results, item, info):
        for ok, value in results:
            image_file_path = results["path"]
        item["front_image_url"] = image_file_path
        return item

保存到JSON

class JsonWithEncodingPipeline(object):
    def __init__(self):
        self.file = codecs.open(‘article.json‘, ‘w‘, encoding=‘utf-8‘)

    def process_item(self,item,spider):
        lines = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(lines)
        return item

    def spider_closed(self,spider):
        self.file.close()

保存到MySQL

同步保存

class MysqlPipeline(object):
    def __init__(self):
        self.conn = MySQLdb.connect(‘127.0.0.1‘, ‘root‘, ‘123456‘, ‘article‘, charset=‘utf8‘, use_unicode=True)
        self.cursor = self.conn.cursor()

    def process_item(self,item,spider):
        insert_sql = ‘‘‘
        insert into jobbole(title,create_date,front_image_url,praise_num,collect_num,comment_num,url,url_object_id) 
        VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
        ‘‘‘

        self.cursor.execute(insert_sql, (item[‘title‘], item[‘create_date‘], item["front_image_url"],
                                         item["praise_num"], item["collect_num"], item["comment_num"],
                                         item["url"], item["url_object_id"]))
        self.conn.commit()

異步保存

class MysqlTwistedPipeline(object):

    def __init__(self, dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings):
        dbparams = dict(
            host=settings[‘MYSQL_HOST‘],
            db=settings[‘MYSQL_DBNAME‘],
            user=settings[‘MYSQL_USER‘],
            password=settings[‘MYSQL_PASSWORD‘],
            charset=‘utf8‘,
            cursorclass=MySQLdb.cursors.DictCursor,
            use_unicode=True,
        )
        dbpool = adbapi.ConnectionPool("MySQLdb", **dbparams)
        return cls(dbpool)

    def process_item(self,item,spider):
        #使用twisted將mysql插入變成異步插入
        query = self.dbpool.runInteraction(self.db_insert, item)
        query.addErrback(self.handler_error, item, spider)

    def handler_error(self,failuer,item,spider):
        #處理異步插入的異常
        print(failuer)

    def db_insert(self,cursor,item):
        insert_sql = ‘‘‘
                    insert into jobbole(title,create_date,front_image_url,praise_num,collect_num,comment_num,url,url_object_id) 
                    VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
                ‘‘‘

        cursor.execute(insert_sql, (item[‘title‘], item[‘create_date‘], item["front_image_url"],
                                         item["praise_num"], item["collect_num"], item["comment_num"],
                                         item["url"], item["url_object_id"]))

爬取伯樂在線文章（四）將爬取結果保存到MySQL

-a 邏輯 inf url cti dba image png post Item Pipeline 當Item在Spider中被收集之後，它將會被傳遞到Item Pipeline，這些Item Pipeline組件按定義的順序處理Item。每個Item Pipeli

爬取伯樂線上文章（二）通過xpath提取原始檔中需要的內容

爬取說明以單個頁面為例，如：http://blog.jobbole.com/110287/ 我們可以提取標題、日期、多少個評論、正文內容等 Xpath介紹 1.　xpath簡介（1）　xpath使用路徑表示式在xml和html中進行導航（2）　xpath包含標準函式庫（3）　xpat

爬取伯樂線上文章（三）爬取所有頁面的文章

之前只是爬取某一篇文章的內容，但是如何爬取所有文章修改start_urls = ['http://blog.jobbole.com/all-posts/'] 重新啟動scrapy的shell parse函式需要做兩件事 1. 獲取文章列表頁中的文章URL並交給scrapy下載後並解

記一次企業級爬蟲系統升級改造（四）：爬取微信公眾號文章（通過搜狗與新榜等第三方平臺）

首先表示抱歉，年底大家都懂的，又涉及SupportYun系統V1.0上線。故而第四篇文章來的有點晚了些~~~對關注的朋友說聲sorry! SupportYun系統當前一覽：　　首先說一下，文章的進度一直是延後於系統開發進度的。　　當前系統V1.0 已經正式上線服役了，這

java線程（四）

讀寫 img 其他手動 dem com 字段只讀停止線程 java5線程並發庫　　線程並發庫是JDK 1.5版本級以上才有的針對線程並發編程提供的一些常用工具類，這些類被封裝在java.concurrent包下。　　該包下又有兩個子包，分別是atomic和lock

java多線程（四）

控制 clas prev 範圍交流群機制 zed 執行 lee 使用synchronized鎖實現線程同步為什麽要用線程同步我們先來看下這段代碼的運行結果： Java學習交流群：495273252 在多線程上篇博客已經介紹過了，JVM采用的是搶占式調度模型，當一

Java 多線程（四）之守護線程（Daemon）

機會兩種如何產生 tex begin 就會 set final 定義 Java 中有兩種線程：一種是用戶線程（User Thread），一種是守護線程（Daemon Thread）。守護線程是一種特殊的線程，它的特殊有“陪伴”的含義，當線程中不存在非守護線程時

openstack系列文章（四）

cnblogs 調度器 5.5 min 代碼位置虛機 inux latest 階段學習 openstack 的系列文章 - Nova Nova 基本概念 Nova 架構 openstack Log Nova 組件介紹 Nova 操作介紹 1. Nova 基本概念

多線程（四）：前臺和後臺線程

current pro oid nag 線程 dsta reads reg con class Program11 { private static void ExecuteInForeground() {

Java多線程（四）java中的Sleep方法

start 線程的生命周期 cnblogs del 廣告 catch 創建 exceptio 分析點我跳過黑哥的卑鄙廣告行為，進入正文。 Java多線程系列更新中~ 　　正式篇： Java多線程（一）什麽是線程 Java多線程（二）關於多線程的CPU密集型和IO密

Git系列文章（四）：常見異常問題

1、GitHub提交的時顯示Updates were rejected because the remote contains work that you do git push -u origin master 每次建立新的倉庫，提交的時總會出現這樣的錯誤。Updates

【NLP】揭祕馬爾可夫模型神祕面紗系列文章（四）

作者：白寧超 2016年7月12日14:08:28 摘要：最早接觸馬爾可夫模型的定義源於吳軍先生《數學之美》一書，起初覺得深奧難懂且無什麼用場。直到學習自然語言處理時，才真正使用到隱馬爾可夫模型，並體會到此模型的妙用之處。馬爾可夫模型在處理序列分類時具體強大的功能，諸如解決：詞類標註、語音識別、句

讀logback原始碼系列文章（四）——記錄日誌

今天晚上本來想來寫一下Logger怎麼記錄日誌，以及Appender元件。不過9點才從丈母孃家回來，又被幾個兄弟喊去喝酒，結果回來晚了，所以時間就只夠寫一篇Logger類的原始碼分析了。Appender找時間再寫上篇部落格介紹了LoggerContext怎麼生成Logger

移動端爬蟲--專案實踐loach--爬去抖音資料（四）

文集移動端爬蟲原始碼 loach loach是一個移動端爬蟲，針對現下很火的短視訊app—抖音支援多個android裝置並行自動化支援任意android裝

雪球網爬取上市公司資訊（一）：爬取上市公司代號

條件：有一批5g相關公司，只知道公司名字或是簡稱，不知道公司是否上市以及股票程式碼，需要爬取公司資訊。網站：雪球網思路：上傳關鍵字，爬取搜尋結果網頁，將有結果的公司資訊抓取下來並存入資料庫 1、在雪球網輸入公司名搜尋，發現返回3個結果，其中search.json?c

詳解c++多線程（四）

pen fun 需要 back oid info logs 結束 www. C++中的原子操作一、atomic模版函數為了避免多個線程同時修改全局變量，C++11除了提供互斥量mutex這種方法以外，還提供了atomic模版函數。使用atomic可以避免使

TiDB Binlog 原始碼閱讀系列文章（四）Pump server 介紹

作者： satoru 在上篇文章中，我們介紹了 TiDB 如何通過 Pump client 將 binlog 發往 Pump，

Git的學習與使用（四）——Git 工作區、暫存區和版本庫

基本概念我們先來理解下Git 工作區、暫存區和版本庫概念工作區：就是你在電腦裡能看到的目錄。暫存區：英文叫stage, 或index。一般存放在 ".git目錄下" 下的index檔案（.git/index）中，所以我們把暫存區有時也叫作索引（index）。版本庫：工作區有一個隱

執行緒執行者（四）執行者執行返回結果的任務

宣告：本文是《 Java 7 Concurrency Cookbook 》的第四章，作者： Javier Fernández González 譯者：許巧輝校對：方騰飛，葉磊執行者執行返回結果的任務 Executor framework的一個優點是你可以併發執行返回結果的任

CentOS 基本指令（四）——將Jetty加入Linux service實現開機自啟動

為了使Jetty上的應用程式每次開機後都可以自動執行，需要把Jetty加入到Linux的service中。使用chkconfig --list命令可以查詢現有的service：將jetty.sh命令複製到etc/init.d路徑下，將jetty加入到自啟動服務中：再次檢視現有的se

爬取伯樂在線文章（四）將爬取結果保存到MySQL

Item Pipeline

編寫item

編寫pipelines

如何將圖片保存到本地

保存到JSON

保存到MySQL

相關推薦