(3).遞歸獲取所有頁碼

阿新 • • 發佈：2018-07-03

一個 name date ges 不想則表達式字符串 limit 我們

# -*- coding: utf-8 -*-
import scrapy


class GetChoutiSpider(scrapy.Spider):
    name = ‘get_chouti‘
    allowed_domains = [‘chouti.com‘]
    start_urls = [‘https://dig.chouti.com/‘]

    def parse(self, response):
        # 在子子孫孫中找到所有id="dig_lcpage"的div標簽
        # 在對應的div標簽中找到所有的a標簽
        # 獲取所有對應a標簽的href屬性
        # 加上extract()獲取字符串
        res = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract()

        for url in res:
            print(url)
            ‘‘‘
            /all/hot/recent/2
            /all/hot/recent/3
            /all/hot/recent/4
            /all/hot/recent/5
            /all/hot/recent/6
            /all/hot/recent/7
            /all/hot/recent/8
            /all/hot/recent/9
            /all/hot/recent/10
            /all/hot/recent/2
            ‘‘‘
        # 會發現這裏有重復的，因為我們起始是第一頁，每次總共分十頁。那麽下一頁指的就是第二頁
        # 所以會發現第二頁重復的href重復了

        # 可以定義一個集合
        urls = set()

        for url in res:
            if url in urls:
                print(f"{url}--此url已存在")
            else:
                urls.add(url)
                print(url)
        ‘‘‘
        /all/hot/recent/2
        /all/hot/recent/3
        /all/hot/recent/4
        /all/hot/recent/5
        /all/hot/recent/6
        /all/hot/recent/7
        /all/hot/recent/8
        /all/hot/recent/9
        /all/hot/recent/10
        /all/hot/recent/2--此url已存在

        ‘‘‘

# -*- coding: utf-8 -*-
import scrapy


class GetChoutiSpider(scrapy.Spider):
    name = ‘get_chouti‘
    allowed_domains = [‘chouti.com‘]
    start_urls = [‘https://dig.chouti.com/‘]

    def parse(self, response):
        # 上面是直接將url進行比較，但是一般情況下我們不直接比較url
        # url我們可能會放在緩存裏，或者放在數據庫裏
        # 如果url很長，會占用空間，因此我們會進行一個加密，比較加密之後的結果
        res = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract()
        # 也可以直接找到所有想要的a標簽
        ‘‘‘
        找到a標簽，什麽樣的a標簽，以"/all/hot/recent/"開頭的a標簽
        res = response.xpath(‘//a[starts-with(@href, "/all/hot/recent/")]/@href‘).extract()
        
        也可以通過正則表達式來找到a標簽，re:test是固定寫法
        res = response.xpath(‘//a[re:test(@href, "/all/hot/recent/\d+")]/@href‘).extract()

        ‘‘‘

        md5_urls = set()
        for url in res:
            md5_url = self.md5(url)
            if md5_url in md5_urls:
                print(f"{url}--此url已存在")
            else:
                md5_urls.add(md5_url)
                print(url)

    def md5(self, url):
        import hashlib
        m = hashlib.md5()
        m.update(bytes(url, encoding="utf-8"))
        return m.hexdigest()

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request


class GetChoutiSpider(scrapy.Spider):
    name = ‘get_chouti‘
    allowed_domains = [‘chouti.com‘]
    start_urls = [‘https://dig.chouti.com/‘]
    # 當遞歸查找時，會反復執行parse，因此md5_urls不能定義在parse函數裏面
    md5_urls = set()

    def parse(self, response):

        res = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract()

        for url in res:
            md5_url = self.md5(url)
            if md5_url in self.md5_urls:
                pass
            else:
                print(url)
                self.md5_urls.add(md5_url)
                # 將新的要訪問的url放到調度器
                url = "https://dig.chouti.com%s" % url
                yield Request(url=url, callback=self.parse)
        ‘‘‘
        /all/hot/recent/2
        /all/hot/recent/3
        /all/hot/recent/4
        /all/hot/recent/5
        /all/hot/recent/6
        /all/hot/recent/7
        /all/hot/recent/8
        /all/hot/recent/9
        /all/hot/recent/10
        /all/hot/recent/1
        /all/hot/recent/11
        /all/hot/recent/12
        ........
        ........
        ........
        /all/hot/recent/115
        /all/hot/recent/116
        /all/hot/recent/117
        /all/hot/recent/118
        /all/hot/recent/119
        /all/hot/recent/120

        ‘‘‘

    def md5(self, url):
        import hashlib
        m = hashlib.md5()
        m.update(bytes(url, encoding="utf-8"))
        return m.hexdigest()

　可以看到，spider將所有的頁碼全都找出來了，但我不想它把全部頁碼都找出來，因此可以指定爬取的深度

技術分享圖片

在settings裏面加上DEPTH_LIMIT=2,表示只爬取兩個深度，即當前十頁完成之後再往後爬取兩個深度。

如果DEPTH_LIMIT<0,那麽只爬取一個深度，等於0，全部爬取，大於0，按照指定值爬取相應的深度

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request


class GetChoutiSpider(scrapy.Spider):
    name = ‘get_chouti‘
    allowed_domains = [‘chouti.com‘]
    start_urls = [‘https://dig.chouti.com/‘]
    # 當遞歸查找時，會反復執行parse，因此md5_urls不能定義在parse函數裏面
    md5_urls = set()

    def parse(self, response):

        res = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract()

        for url in res:
            md5_url = self.md5(url)
            if md5_url in self.md5_urls:
                pass
            else:
                print(url)
                self.md5_urls.add(md5_url)
                # 將新的要訪問的url放到調度器
                url = "https://dig.chouti.com%s" % url
                yield Request(url=url, callback=self.parse)
        ‘‘‘
        /all/hot/recent/2
        /all/hot/recent/3
        /all/hot/recent/4
        /all/hot/recent/5
        /all/hot/recent/6
        /all/hot/recent/7
        /all/hot/recent/8
        /all/hot/recent/9
        /all/hot/recent/10
        /all/hot/recent/1
        /all/hot/recent/11
        /all/hot/recent/12
        /all/hot/recent/13
        /all/hot/recent/14
        /all/hot/recent/15
        /all/hot/recent/16
        /all/hot/recent/17
        /all/hot/recent/18

        ‘‘‘

    def md5(self, url):
        import hashlib
        m = hashlib.md5()
        m.update(bytes(url, encoding="utf-8"))
        return m.hexdigest()

　　技術分享圖片

因此在當前十頁爬取完畢之後，再往下一個深度，是十四頁，再往下一個深度是十八頁

(3).遞歸獲取所有頁碼

一個 name date ges 不想則表達式字符串 limit 我們 # -*- coding: utf-8 -*- import scrapy class GetChoutiSpider(scrapy.Spider): name = ‘get_chou

Think PHP遞歸獲取所有的子分類的ID （刪除當前及子分類）

cti eid 刪除 error return code where arr pre 遞歸獲取所有的子分類的ID： //遞歸獲取所有的子分類的ID function get_all_child($array,$id){ $arr = array(); fo

Java 遞歸獲取一個路徑下的所有文件，文件夾名稱

文件夾 pos static else director args 獲取 body oid package com.readfile; import java.io.File; public class GetAllFiles { public static vo

[Java] File類遞歸獲取目錄下所有文件/文件夾

div static absolut 遞歸 urn 返回 direct gpo private package com.xiwi; import java.io.*; import java.util.*; class file{ public st

PHP 根據子ID遞歸獲取父級ID，實現逐級分類導航效果

top () == return clas urn php nbsp 遞歸代碼： //當前路徑 $cate=M(‘wangpan_class‘)->select(); function get_top_parentid($cate,$i

python 3 遞歸調用與二分法

turn print 階段 binary class 效率 clas 技術分享空間遞歸調用與二分法 1、遞歸調用遞歸調用：在調用一個函數的過程中，直接或間接地調用了函數本身. 示例： def age(n): if n == 1: return

php遞歸刪除所有文件

log bsp class light dir func return 文件 readdir function del_file($dir) { if (@rmdir($dir)==false && is_dir($dir)) { if ($dp

遞歸獲取當前節點全部指定類型的子節點

lang 類型當前能夠 tle nodetype i++ 文檔 not 在線預覽方法使用nodeType判斷類型，在allChildNodes方法內建立遞歸函數將allCN封裝在方法內。 <!DOCTYPE html> <html lang="

iOS遞歸獲取子視圖

get 獲取顯示 with uiview and ted gets IT // 遞歸獲取子視圖 - (void)getSub:(UIView *)view andLevel:(int)level { NSArray *subviews = [view subvie

遞歸獲取包下的class文件

replace str lse main agen ogr pub lac 獲取 ```java(這個居然隱藏不了) public class TestUrl { public static void main(String[] args) { String p

Java遞歸獲取部門樹返回ztree數據

stat nod als array response java equal asp tree @GetMapping("/getDept")@ResponseBodypublic Tree<DeptDO> getDept(String deptId){

3.4.3遞歸算法的效率分析

class cci 系統 ping public alt 輔助 tro block 1.時間復雜度的分析在算法分析中，當一個算法中包含遞歸調用時，其時間復雜度的分析可以轉化為一個遞歸方程求解。也就是數學上求漸進解得問題，而遞歸方唱的形式多種多樣，其求解方法也不盡

6-3 遞歸求Fabonacci數列（10 分）

stdio.h 輸入整型 text bottom sca als tex spa 6-3 遞歸求Fabonacci數列（10 分）本題要求實現求Fabonacci數列項的函數。Fabonacci數列的定義如下： f(n)=f(n?2)+f(n?1) (n≥2)，

用shell指令碼從多個不相關的目錄中遞迴獲取所有影象的路徑

get_imagelist.sh原始碼： dir_list=( "/opt/win/tim.zhong/database/face_dataset/image

PHP遞迴獲取所有下級

<?php $data = [ [ 'uid' => 1, 'username' => '155', 'parent_username' => '0' ],

遞歸的邏輯(3)——遞歸與分治

端點基本 lse 遠的 rec reat 商業部分 -s 　　遞歸和分治天生就是一對好朋友。所謂分治，顧名思義，就是分而治之，是一種相當古老的方法。　　在遙遠的周朝，人們受生產力水平所限，無法管理龐大的土地和眾多的人民，因此采用了封邦建國的封建制度，把土地一層一層劃分

python編寫爬蟲獲取區域程式碼-遞迴獲取所有子頁面

上一篇文章用htmlparser寫了一個java的獲取區域的爬蟲，覺得太笨重。發現python也可以實現這個功能。這裡就簡單寫一個用python3寫的小爬蟲例子功能目標：對指定網站的所有區域資訊進行篩選，並儲存到文字中思路：1、定義一個佇列，初始向佇列中put一個地址

父子結構數據(id，pid)遞歸查詢所有子id合集和父id合集

dal tde main 測試實體類 static != private 部門查詢子id合集創建實體類 @Data public class Department {　 private int id; private int pid;

Java非遞歸的方式獲取目錄中所有文件（包括目錄）

class cto div 所有 new dir rem efi log 零、思路解析對於給出的文件查看其下面的所有目錄，將這個目錄下的所有目錄放入待遍歷的目錄集合中，每次取出該集合中的目錄遍歷，如果是目錄再次放入該目錄中進行遍歷。一、代碼 /**

java基礎 File與遞歸練習使用文件過濾器篩選將指定文件夾下的小於200K的小文件獲取並打印按層次打印(包括所有子文件夾的文件)

tor accep length 按層 col 不存在 args name style package com.swift.kuozhan; import java.io.File; import java.io.FileFilter; /*使用文件過濾器篩選將指定文

(3).遞歸獲取所有頁碼

相關推薦