第八篇編寫spider爬取jobbole的所有文章

阿新 • • 發佈：2017-10-03

strip 狀態第一個 lds ont style cnblogs pycha 目標

通過scrapy的Request和parse，我們能很容易的爬取所有列表頁的文章信息。

PS:parse.urljoin（response.url，post_url）的方法有個好處,如果post_url是完整的域名，則不會拼接response.url的主域名，如果不是完整的，則會進行拼接

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
#這個是python3中的叫法，python2中是直接import urlparse
from urllib import parse

class JobboleSpider(scrapy.Spider):
     
# 爬蟲名字
    name = ‘jobbole‘
    # 運行爬取的域名
    allowed_domains = [‘blog.jobbole.com‘]
    # 開始爬取的URL
    start_urls = [‘http://blog.jobbole.com/tag/linux/‘]
    #start_urls = [‘https://javbooks.com/content_censored/169018.htm‘]

    def parse(self,response):
        """
        獲取文章列表頁url
        :param response:
        :return:
         
"""
        blog_url = response.css(".floated-thumb .post-meta .read-more a::attr(href)").extract()
        for post_url in blog_url:
            #scrapy內置了根據url來調用“頁面爬取模塊”的方法Resquest,入參有訪問的url和回調函數
            yield Request(url=parse.urljoin(response.url,post_url),callback=self.parse_detail)
             
#由於伯樂在線的文章列表頁裏的href的域名是全稱”http://blog.jobbole.com/112535/“
            #但存在href只記錄112535的情況，這時候需要拼接出完整的url，可以使用urllib庫的parse函數
            #Request(url=parse.urljoin(response.url,post_url),callback=self.parse_detail)
            print(post_url)
            #下一頁url
            next_url = response.css(".next.page-numbers::attr(href)").extract_first()
            if next_url:
                yield Request(url=parse.urljoin(response.url,next_url),callback=self.parse)

    def parse_detail(self, response):
        """
        獲取文章詳情頁
        :param response:
        :return:
        """
        ret_str = response.xpath(‘//*[@class="dht_dl_date_content"]‘)
        title = response.css("div.entry-header h1::text").extract_first()
        create_date = response.css("p.entry-meta-hide-on-mobile::text").extract_first().strip().replace("·", "").strip()
        content = response.xpath("//*[@id=‘post-112239‘]/div[3]/div[3]/p[1]")

Items

主要目標是從非結構化來源（通常是網頁）提取結構化數據。Scrapy爬蟲可以將提取的數據作為Python語句返回。雖然方便和熟悉，Python dicts缺乏結構：很容易在字段名稱中輸入錯誤或返回不一致的數據，特別是在與許多爬蟲的大項目。

要定義公共輸出數據格式，Scrapy提供Item類。 Item對象是用於收集所抓取的數據的簡單容器。它們提供了一個類似字典的 API，具有用於聲明其可用字段的方便的語法。

各種Scrapy組件使用項目提供的額外信息：導出器查看聲明的字段以計算要導出的列，序列化可以使用項字段元數據trackref 定制，跟蹤項實例以幫助查找內存泄漏（請參閱使用trackref調試內存泄漏）等。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ArticlespiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class JoBoleArticleItem(scrapy.Item):
    #標題
    title = scrapy.Field()
    #創建日期
    create_date  = scrapy.Field()
    #文章url
    url = scrapy.Field()
    #url是長度不定的，可以轉換成固定長度的md5
    url_object_id = scrapy.Field()
    #圖片url
    front_image_url = scrapy.Field()
    #圖片路徑url
    front_image_path = scrapy.Field()
    #點贊數
    praise_num = scrapy.Field()
    #評論數
    comment_num = scrapy.Field()
    #收藏數
    fav_num = scrapy.Field()
    #標簽
    tags = scrapy.Field()
    #內容
    content = scrapy.Field()

scrapy內置了文件下載、圖片下載等方法，可以通過scrapy源碼文件查看有哪些：

PS：scrapy存儲數據的配置文件是在project目錄下的pipelines.py中，而查看內置了哪些下載的類，也在源碼的pipelines目錄裏，如下圖所示：

技術分享

接著在settings.py裏配置，在ITEM_PIPELINES字典裏配置上這個類，這個字典是scrapy自帶的，默認在settings裏是註釋掉的，後面的數字表示優先級，數值越小，調用時優先級越高。接著配置圖片的Item字段

IMAGES_URLS_FIELD = ‘front_image_url‘

IMAGES_URLS_FIELD是固定寫法，front_image_url是item名稱

IMAGEs_STORE指定圖片存放路徑

PS：python保存圖片時，需要先安裝一個庫：pillow

技術分享

上面的圖片保存下來後，發現scrapy會自動給圖片命名，如果不想使用這種名稱，比如想使用文章的路徑名，那可以在pipeline.py文件裏進行自定義。

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline

class ArticlespiderPipeline(object):
    def process_item(self, item, spider):
        return item

#進行圖片下載定制，可以通過繼承scrapy內置的imagespipeline來重載某些功能
class ArticleImagePipeline(ImagesPipeline):
    #通過查看ImagesPipeline類可以了解是由下面這方法圖片命名
    def item_completed(self, results, item, info):
        pass

上面這個item_completed方法是ImagePipeline裏的，這裏我們需要對它進行重載，但是關於裏面的入參，可以通過pycharm的debug調試查看：

技術分享

可以看到result是個是個tuple，第一個值是返回狀態，第二個是個嵌套dict，其中path是我們想要的。

#進行圖片下載定制，可以通過繼承scrapy內置的imagespipeline來重載某些功能
class ArticleImagePipeline(ImagesPipeline):
    #通過查看ImagesPipeline類可以了解是由下面這方法圖片命名
    def item_completed(self, results, item, info):
        for ok,value in results:
            image_file_path = value[‘path‘]
            item[‘front_image_url‘] = image_file_path
        return item

再接著，是把url名稱進行md5加密，這樣可以讓url變成一個唯一的且長度固定的值

可以在項目裏單獨創建個目錄，用來存放這些函數：

# -*- conding:utf-8 -*-

import hashlib
def get_md5(url):
    if isinstance(url,str):
        url = url.encode("utf-8")
    m = hashlib.md5()
    m.update(url)
    return m.hexdigest()

if __name__ == "__main__":
    print(get_md5("www.baidu.com"))
結果：
dab19e82e1f9a681ee73346d3e7a575e

然後調用這個函數存到item裏就行：

article_item["url_object_id"] =get_md5(response.url)

第八篇編寫spider爬取jobbole的所有文章

strip 狀態第一個 lds ont style cnblogs pycha 目標通過scrapy的Request和parse，我們能很容易的爬取所有列表頁的文章信息。 PS:parse.urljoin（response.url，post_url）的方法有個好處,

第八篇編寫spider爬取jobbole的所有文章

Items

第八篇編寫spider爬取jobbole的所有文章

編寫spider爬取

【滲透課程】第八篇-上傳漏洞之文本編輯器上傳

Python爬蟲從入門到放棄（十八）之 Scrapy爬取所有知乎用戶信息(上)

第八篇：面向對象編程

第八篇：python基礎_8 面向對象與網絡編程

深入理解ajax系列第八篇

數據結構第八篇——鏈棧

第八篇 CSS定位

第八篇 elasticsearch鏈接mysql自動更新數據庫

Python之路【第八篇】：堡壘機實例以及數據庫操作

Linux實戰第八篇：CentOS7.3下Nginx虛擬主機配置實戰（基於端口）

秒殺多線程第八篇經典線程同步信號量Semaphore

Django學習筆記第八篇--實戰練習四--為你的視圖函數自定義裝飾器

web前端【第八篇】JS的DOM對象二

Django 【第八篇】Django自帶的分頁器

Flask 【第八篇】flask-session組件

第八篇Django分頁

從.Net到Java學習第八篇——SpringBoot實現session共享和國際化

Python全棧開發之路【第八篇】：面向對象編程設計與開發（2）

第八篇 編寫spider爬取jobbole的所有文章

Items

相關推薦

第八篇編寫spider爬取jobbole的所有文章