1. 程式人生 > >Python的scrapy之爬取6毛小說網的聖墟

Python的scrapy之爬取6毛小說網的聖墟

閒來無事想看個小說,打算下載到電腦上看,找了半天,沒找到可以下載的網站,於是就想自己爬取一下小說內容並儲存到本地

聖墟 第一章 沙漠中的彼岸花 - 辰東 - 6毛小說網  http://www.6mao.com/html/40/40184/12601161.html

這是要爬取的網頁

觀察結構

下一章

然後開始建立scrapy專案:

其中sixmaospider.py:

# -*- coding: utf-8 -*-
import scrapy
from ..items import SixmaoItem


class SixmaospiderSpider(scrapy.Spider):
    name 
= 'sixmaospider' #allowed_domains = ['http://www.6mao.com'] start_urls = ['http://www.6mao.com/html/40/40184/12601161.html'] #聖墟 def parse(self, response): novel_biaoti = response.xpath('//div[@id="content"]/h1/text()').extract() #print(novel_biaoti) novel_neirong=response.xpath('
//div[@id="neirong"]/text()').extract() print(novel_neirong) #print(len(novel_neirong)) novelitem = SixmaoItem() novelitem['novel_biaoti'] = novel_biaoti[0] print(novelitem['novel_biaoti']) for i in range(0,len(novel_neirong),2): #print(novel_neirong[i])
novelitem['novel_neirong'] = novel_neirong[i] yield novelitem #下一章 nextPageURL = response.xpath('//div[@class="s_page"]/a/@href').extract() # 取下一頁的地址 nexturl='http://www.6mao.com'+nextPageURL[2] print('下一章',nexturl) if nexturl: url = response.urljoin(nexturl) # 傳送下一頁請求並呼叫parse()函式繼續解析 yield scrapy.Request(url, self.parse, dont_filter=False) pass else: print("退出") pass

pipelinesio.py 將內容儲存到本地檔案

import os
print(os.getcwd())


class SixmaoPipeline(object):
    def process_item(self, item, spider):
        #print(item['novel'])

        with open('./data/聖墟.txt', 'a', encoding='utf-8') as fp:
            fp.write(item['novel_neirong'])
            fp.flush()
            fp.close()
        return item
    print('寫入檔案成功')

items.py

import scrapy


class SixmaoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    novel_biaoti=scrapy.Field()
    novel_neirong=scrapy.Field()
    pass

startsixmao.py,直接右鍵這個執行,專案就開始運行了

from scrapy.cmdline import execute

execute(['scrapy', 'crawl', 'sixmaospider'])

settings.py

LOG_LEVEL='INFO'   #這是加日誌
LOG_FILE='novel.log'

DOWNLOADER_MIDDLEWARES = {
    'sixmao.middlewares.SixmaoDownloaderMiddleware': 543,
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
    'sixmao.rotate_useragent.RotateUserAgentMiddleware' :400  #這行是使用代理
}


ITEM_PIPELINES = {
    #'sixmao.pipelines.SixmaoPipeline': 300,
    'sixmao.pipelinesio.SixmaoPipeline': 300,

}  #在pipelines輸出管道加入這個

SPIDER_MIDDLEWARES = {
   'sixmao.middlewares.SixmaoSpiderMiddleware': 543,
}  #開啟中介軟體 其餘地方應該不需要改變

rotate_useragent.py  給專案加代理,防止被伺服器禁止

# 匯入random模組
import random
# 匯入useragent使用者代理模組中的UserAgentMiddleware類
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

# RotateUserAgentMiddleware類,繼承 UserAgentMiddleware 父類
# 作用:建立動態代理列表,隨機選取列表中的使用者代理頭部資訊,偽裝請求。
#       繫結爬蟲程式的每一次請求,一併傳送到訪問網址。

# 發爬蟲技術:由於很多網站設定反爬蟲技術,禁止爬蟲程式直接訪問網頁,
#             因此需要建立動態代理,將爬蟲程式模擬偽裝成瀏覽器進行網頁訪問。
class RotateUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        #這句話用於隨機輪換user-agent
        ua = random.choice(self.user_agent_list)
        if ua:
            # 輸出自動輪換的user-agent
            print(ua)
            request.headers.setdefault('User-Agent', ua)

    # the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
    # for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
    # 編寫頭部請求代理列表
    user_agent_list = [\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
       ]

最終執行結果:

吶吶吶,這就是一個小的scrapy專案了