利用爬蟲爬取看看豆網站站的資料資訊

阿新 • • 發佈：2019-02-10

其實很早我就開始關注爬蟲技術，這兩天特別學習了一下，並且做了一個簡單的demo。爬取了看看豆網站的資料資訊。總共11751本書，爬取了不到3個小時，基本每秒爬取1條。速度慢的原因主要是單執行緒，使用mysql資料庫。想要提高速度的話可以使用多執行緒和redis。但是對於初學者來說只要能爬取下來就很不錯了。在這裡我使用了一個爬蟲框架---phpspider。

爬取完成後，我把資料從資料庫中導成.csv格式。由於我的資料比較少，所以我是直接excle中做的分析。按道理來說應該使用JS的ECharts做資料分析的。但是我的Node.JS還沒有學習過。所以先簡單的進行測試。另，我已經把自己的demo上傳到了github。

https://github.com/XiaoTommy/phpspider

附:

demo原始碼如下：

<?php
ini_set("memory_limit", "1024M");
require dirname(__FILE__).'/../core/init.php';

/* Do NOT delete this comment */
/* 不要刪除這段註釋 */

$configs = array(
    'name' => '看豆豆',
//    'log_show' => false,
    'tasknum' => 1,
    //'save_running_state' => true,
    'domains' => array(
        'kankandou.com',
        'www.kankandou.com'
    ),
    'scan_urls' => array(
        'https://kankandou.com/'
    ),
    'list_url_regexes' => array(
        "https://kankandou.com/book/page/\d+"
    ),
    'content_url_regexes' => array(
        "https://kankandou.com/book/view/\d+.html",
    ),
    'max_try' => 5,
    //'export' => array(
    //'type' => 'csv',
    //'file' => PATH_DATA.'/qiushibaike.csv',
    //),
    //'export' => array(
    //'type'  => 'sql',
    //'file'  => PATH_DATA.'/qiushibaike.sql',
    //'table' => 'content',
    //),
    'export' => array(
        'type' => 'db',
        'table' => 'kankandou',
    ),
    'fields' => array(
        array(
            'name' => "book_name",
            'selector' => "//h1[contains(@class,'title')]/text()",
            'required' => true,
        ),
        array(
            'name' => "book_content",
            'selector' => "//div[contains(@class,'content')]/text()",
            'required' => true,
        ),
        array(
            'name' => "book_author",
            'selector' => "//p[contains(@class,'author')]/a",
            'required' => true,
        ),
        array(
            'name' => "book_img",
            'selector' => "//div[contains(@class,'img')]/a/img",
            'required' => true,
        ),
        array(
            'name' => "book_format",
            'selector' => "//p[contains(@class,'ext')]",
            'required' => true,
        ),
        array(
            'name' => "book_class",
            'selector' => "//p[contains(@class,'cate')]/a",
            'required' => true,
        ),
        array(
            'name' => "click_num",
            'selector' => "//i[contains(@class,'dc')]",
            'required' => true,
        ),
        array(
            'name' => "download_num",
            'selector' => "//i[contains(@class,'vc')]",
            'required' => true,
        ),
    ),
);

$spider = new phpspider($configs);

$spider->on_handle_img = function($fieldname, $img)
{
    $regex = '/src="(https?:\/\/.*?)"/i';
    preg_match($regex, $img, $rs);
    if (!$rs)
    {
        return $img;
    }

    $url = $rs[1];
    $img = $url;

    //$pathinfo = pathinfo($url);
    //$fileext = $pathinfo['extension'];
    //if (strtolower($fileext) == 'jpeg')
    //{
    //$fileext = 'jpg';
    //}
    //// 以納秒為單位生成隨機數
    //$filename = uniqid().".".$fileext;
    //// 在data目錄下生成圖片
    //$filepath = PATH_ROOT."/images/{$filename}";
    //// 用系統自帶的下載器wget下載
    //exec("wget -q {$url} -O {$filepath}");

    //// 替換成真是圖片url
    //$img = str_replace($url, $filename, $img);
    return $img;
};


$spider->start();

分析資料：我只是簡單的統計了一下，重點不是在這裡。其中點選次數和下載次數有點小錯誤，暫時還在排查錯誤中。

其中有任何不對的地方，望請指正。

利用爬蟲爬取看看豆網站站的資料資訊

利用爬蟲爬取看看豆網站站的資料資訊

Python爬蟲爬取美劇網站

python爬蟲爬取拉勾網站內容

福利向---Scrapy爬蟲爬取多級圖片網站

Python爬蟲爬取古詩文網站專案分享

Python爬蟲爬取51job招聘網站

利用python爬取實習僧網站上的資料

利用爬蟲爬取LOL官網上面板圖片

43.scrapy爬取鏈家網站二手房資訊-1

44.scrapy爬取鏈家網站二手房資訊-2

python爬蟲爬取京東店鋪商品價格資料(更新版)

python爬蟲爬取淘寶搜尋頁面商品資訊資料

python網路爬蟲爬取汽車之家的最新資訊和照片

Python爬蟲-爬取騰訊QQ招聘崗位資訊（Beautiful Soup）

Scrapy專案(鬥魚直播)---利用Spider爬取顏值下的美女資訊

利用python爬取IP地址歸屬地等資訊！

（python）如何利用python深入爬取自己想要的資料資訊

scrapy爬取愛上租網站的房源資訊（一）

利用Jsoup爬取天貓列表頁資料

利用python爬取我愛我家租賃房源資訊

利用爬蟲爬取看看豆網站站的資料資訊

相關推薦