webmagic爬取百度知道的問答對並存到資料庫

阿新 • • 發佈：2019-01-20

（1）定義資料庫爬取的title：

package shuju;

public class baidu {

    private String author;// 編號

    public String getAuthor() {
        return author;
    }

    public void setAuthor(String author) {
        this.author = author;
    }

    public String toString() {
        return "shuju [author=" + author + "]" 
;
    }
}

（2）連線到資料庫（dao檔案）

package shuju;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;

public class baiduDao {

    private Connection conn = null;
    public baiduDao() {
        try {
            Class.forName("com.mysql.jdbc.Driver" 
);
            String url =   "jdbc:mysql://localhost:3306/test?"
                    + "user=root&password=******&useUnicode=true&characterEncoding=UTF8";

            //String url = "jdbc:mysql://127.0.0.1:3306/test?user=root&password=******";

            conn = DriverManager.getConnection(url);
            conn.createStatement();
        } catch 
 (ClassNotFoundException e) {
            e.printStackTrace();
        } catch (SQLException e) {
            e.printStackTrace();
        }

    }

    public int add(baidu shuju) {
        try {
            String sql = "INSERT INTO `test`.`shuju` (`author`) VALUES (?);";
            PreparedStatement ps = conn.prepareStatement(sql);
            ps.setString(1, shuju.getAuthor());

            return ps.executeUpdate();
        } catch (SQLException e) {
            e.printStackTrace();
        }
        return -1;
    }
}

（3）定製爬蟲的核心部分

package shuju;

import java.util.List;

import javax.management.JMException;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.monitor.SpiderMonitor;
import us.codecraft.webmagic.pipeline.JsonFilePipeline;
import us.codecraft.webmagic.processor.PageProcessor;
 /** 
    定製爬蟲的核心部分
 */  
public class baiduzhengti implements PageProcessor {  
    private Site site = Site.me().setSleepTime(1);  
    public Site getSite() {  
        return site;  
    }  
    int temp=1;  
     //process（過程）  
     //學科列表
     //初始頁面的url的正則表示式
    public static final String URL_LIST=
            "http://zhidao\\.baidu\\.com/list\\?"
            +"cid=110106\\&tag=\\w+";

       //某學科的問題列表
       //問題頁面的正則表示式
      public static final String URL_AIRICLE = 
            "http://zhidao\\.baidu\\.com/question"
            +"/\\d+\\.html\\?fr=qlquick\\&entry"
            +"=qb_list_default";

    public void process(Page page) {  
        //class是每個頁面中問題列表中單個問題的url
        List<String> pages = page.getHtml().xpath("[@class='question-list-item']").links().all(); 
        //如果進入這個頁面後class符合要求那麼就將其連結加入爬取佇列
        page.addTargetRequests(pages);
        //將資料存入資料庫
        baidu shuju = new baidu();
        //將列表中的一條開啟後的頁面的標題獲取。class為單個問題頁面的class。
        page.putField("title", page.getHtml().xpath("//[@class='ask-title']/text()").toString()); 
        shuju.setAuthor(page.getHtml().xpath("//[@class='ask-title']/text()").get());       
        new baiduDao().add(shuju);
        // 把物件輸出控制檯
        System.out.println(shuju);
        // System.out.println(pages);
    }

    //執行這個main方法，即可在控制檯看到抓取結果。webmagic預設有3秒抓取間隔，請耐心等待。  
    public static void main(String[] args) {  
        //Spider是爬蟲的入口類,addurl為入口url  
        Spider oschinaSpider = Spider.create(new baiduzhengti()).addUrl("http://zhidao.baidu.com/list?cid=110106&tag=JSP")   //Pipeline是結果輸出和持久化的介面，這裡ConsolePipeline表示結果輸出到控制檯  

//類                     說明                  備註
//ConsolePipeline 輸出結果到控制檯  抽取結果需要實現toString方法
//FilePipeline     儲存結果到檔案  抽取結果需要實現toString方法
//JsonFilePipeline     JSON格式儲存結果到檔案     
//ConsolePageModelPipeline (註解模式)輸出結果到控制檯    
//FilePageModelPipeline   (註解模式)儲存結果到檔案  
//JsonFilePageModelPipeline   (註解模式)JSON格式儲存結果到檔案   想要持久化的欄位需要有getter方法
.addPipeline(new JsonFilePipeline("F:/data")); 
 //將檔案存貯到本地繼續進行解析jsoup
        try {  
            //新增到JMT監控中  
            SpiderMonitor.instance().register(oschinaSpider);  
            //設定執行緒數  
            //oschinaSpider.thread(5);  
            oschinaSpider.run();  
        } catch (JMException e) {  
            e.printStackTrace();  
        }   
    }  
}

webmagic爬取百度知道的問答對並存到資料庫

（1）定義資料庫爬取的title： package shuju; public class baidu { private String author;// 編號 public String getAuthor() {

java爬取百度首頁源代碼

clas read 意思出現異常 nts java.net new 有意思 all 爬蟲感覺挺有意思的，寫一個最簡單的抓取百度首頁html代碼的程序。雖然簡單了一點，後期會加深的。 1 package test; 2 3 import java.io.B

requests+xpath+map爬取百度貼吧

name ads int strip 獲取 app open http col 1 # requests+xpath+map爬取百度貼吧 2 # 目標內容:跟帖用戶名,跟帖內容,跟帖時間 3 # 分解: 4 # requests獲取網頁 5 # xpath提取內

Python開發簡單爬蟲（二）---爬取百度百科頁面數據

class 實例實例代碼編碼 mat 分享 aik logs title 一、開發爬蟲的步驟 1.確定目標抓取策略：打開目標頁面，通過右鍵審查元素確定網頁的url格式、數據格式、和網頁編碼形式。 ①先看url的格式, F12觀察一下鏈接的形式;② 再看目標文本信息的

python爬取百度搜索圖片

知乎需要 with 異常 mage 不足 request height adr 在之前通過爬取貼吧圖片有了一點經驗，先根據之前經驗再次爬取百度搜索界面圖片廢話不說，先上代碼 #!/usr/bin/env python # -*- coding: utf-8 -*- #

Python爬取百度貼吧數據

utf-8 支持我 family encode code word keyword 上一條時間　　本渣除了工作外，在生活上還是有些愛好，有些東西，一旦染上，就無法自拔，無法上岸，從此走上一條不歸路。花鳥魚蟲便是我堅持了數十年的愛好。　　本渣還是需要上班，才能支持我的

Python簡易爬蟲爬取百度貼吧圖片

decode works 接口 def 讀取 min baidu 得到 internal 　　　　　通過python 來實現這樣一個簡單的爬蟲功能，把我們想要的圖片爬取到本地。(Python版本為3.6.0) 一.獲取整個頁面數據　　 def getHtml(url)

python爬取百度搜索結果ur匯總

百度搜索 sta attr amp end rom range 百度篩選寫了兩篇之後，我覺得關於爬蟲，重點還是分析過程分析些什麽呢： 1）首先明確自己要爬取的目標　　比如這次我們需要爬取的是使用百度搜索之後所有出來的url結果 2）分析手動進行的獲取目標的過程，以便

python 爬取百度url

style not 域名 head dex fin compile threads www 1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 # @Date : 2017-08-29 18:38:23 4

多線程爬取百度百科

lib item put 腳本 mit sin find client rtl 前言：EVERNOTE裏的一篇筆記，我用了三個博客才學完...真的很菜...百度百科和故事網並沒有太過不一樣，修改下編碼，debug下，就可以爬下來了，不過應該是我爬的東西太初級了，而且我爬到

【學習筆記】python爬取百度真實url

python 今天跑個腳本需要一堆測試的url，，，挨個找復制粘貼肯定不是程序員的風格，so，還是寫個腳本吧。環境：python2.7 編輯器：sublime text 3 一、分析一下首先非常感謝百度大佬的url分類非常整齊，都在一個

Python爬蟲實例（一）爬取百度貼吧帖子中的圖片

選擇圖片查看負責 targe mpat wid agent html headers 程序功能說明：爬取百度貼吧帖子中的圖片，用戶輸入貼吧名稱和要爬取的起始和終止頁數即可進行爬取。思路分析：一、指定貼吧url的獲取例如我們進入秦時明月吧，提取並分析其有效url如下

Python基於urllib,re爬取百度的國內即時新聞

正則匹配分享 str 導入 findall term 下載 pytho tex Python應用於爬蟲領域業界已經相當的廣泛了，今天就采用urllib + re 爬取下百度國內即時新聞。軟件環境：Python : 3.6.0 PyCharm: Community

Python3實現QQ機器人自動爬取百度文庫的搜索結果並發送給好友（主要是爬蟲）

OS __main__ end aid 機器 https code __name__ gbk 一、效果如下：二、運行環境： win10系統；python3；PyCharm 三、QQ機器人用的是qqbot模塊用pip安裝命令是： pip

最最簡單的python爬蟲教程--爬取百度百科案例

python爬蟲；人工智能from bs4 import BeautifulSoupfrom urllib.request import urlopenimport reimport randombase_url = "https://baike.baidu.com"#導入相關的包 his

python爬取百度圖片代碼

python爬蟲；import json import itertools import urllib import requests import os import re import sys word=input("請輸入關鍵字：") path="./ok" if

python爬取百度翻譯返回：{'error': 997, 'from': 'zh', 'to': 'en', 'query 問題

escape result words fan use rip odin 解決 base 解決辦法：修改url為手機版的地址：http://fanyi.baidu.com/basetrans User-Agent也用手機版的測試代碼： # -*- coding: utf

Python爬蟲 - 爬取百度html代碼前200行

http src mage bsp bubuko str 百度爬蟲圖片 Python爬蟲 - 爬取百度html代碼前200行 - 改進版, 增加了對字符串的.strip()處理 Python爬蟲 - 爬取百度html代碼前200行

pythonp爬蟲爬取百度音樂

www code focus rfi aid xtra trac cookie bds #coding=utf-8 import requests import re import time from bs4 import BeautifulSoup

Python爬蟲為何可以這麽叼？爬取百度雲盤資源！並保存到自己雲盤

源碼下載表達 har .cn bdb 裏的 image AC 賬號登錄點擊它，再點擊右邊的【Cookies】就可以看到請求頭裏的 cookie 情況。 cookie分析除了上面說到的兩個 cookie ，其他的請求頭參數可以參照手動轉存

webmagic爬取百度知道的問答對並存到資料庫

相關推薦