基於Jsoup實現的簡單網路爬蟲

阿新 • • 發佈：2019-01-03

之前是完全不會爬蟲的，但是新專案中需要從網頁上爬一大堆的資料，所以就花了一天時間學習了下。主題部分還是很簡單的。
* 既然想要寫博文，那我就要寫的細緻點，對自己對讀者都是一種負責！

什麼是爬蟲？

我所理解的爬蟲就是從網際網路上獲取Url，順著Url一個一個的去訪問頁面
一個頁面會有很多的連結，對於每個連結可以判斷是否使我們想要的，再對子連結進行操作、訪問等等。

for each 連結 in 當前網頁所有的連結
{
        if(如果本連結是我們想要的 || 這個連結從未訪問過)
        {
                處理對本連結
                把本連結設定為已訪問
        }
}

對於爬蟲：
1. 首先你需要給定一個種子連結。
2. 在種子連結當中尋找你要的子連結。
3. 通過子連結訪問其他頁面，在頁面中獲取你所需要的內容。

這當中涉及到的內容有：

Http
內容解析器

其主要的過程是這樣的：

取一個種子URL，比如www.oschina.net
通過httpclient請求獲取頁面資源(獲取的頁面資源其中肯定包含了其他的URL，可以作為下一個種子迴圈使用)
通過正則或者jsoup解析出想要的內容(解析出其他的URL連結，同時獲取本頁面的所有圖片，這都是可以的)
使用3獲取的下一個種子URL，重複1

我們先來看下如何用 HttpClient 獲取到整個頁面：
在使用 HttpClient 之前，你需要先匯入 HttpClint.jar 包

在HttpClient jar更新之後，使用的例項都是 CloseableHttpClient 了所以我們也用它

 public static String get(String url){
        String result = "";
        try {
            //獲取httpclient例項
            CloseableHttpClient httpclient = HttpClients.createDefault();
            //獲取方法例項。GET
            HttpGet httpGet = new 
 HttpGet(url);
            //執行方法得到響應
            CloseableHttpResponse response = httpclient.execute(httpGet);
            try {
                //如果正確執行而且返回值正確，即可解析
                if (response != null
                        && response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
                    System.out.println(response.getStatusLine());
                    HttpEntity entity = response.getEntity();
                    //從輸入流中解析結果
                    result = readResponse(entity, "utf-8");
                }
            } finally {
                httpclient.close();
                response.close();
            }
        }catch (Exception e){
            e.printStackTrace();
        }
        return result;
    }

    /**
     * stream讀取內容，可以傳入字元格式
     * @param resEntity
     * @param charset
     * @return
     */
    private static String readResponse(HttpEntity resEntity, String charset) {
        StringBuffer res = new StringBuffer();
        BufferedReader reader = null;
        try {
            if (resEntity == null) {
                return null;
            }

            reader = new BufferedReader(new InputStreamReader(
                    resEntity.getContent(), charset));
            String line = null;

            while ((line = reader.readLine()) != null) {
                res.append(line);
            }

        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (reader != null) {
                    reader.close();
                }
            } catch (IOException e) {
            }
        }
        return res.toString();
    }

通過這種方法，獲取到的是整個頁面的資源，其中包含了Html的程式碼（www.baidu.com）：

<!DOCTYPE html>
<!--STATUS OK-->
<html>
 <head>
  <meta http-equiv="content-type" content="text/html;charset=utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=Edge">
  <meta content="always" name="referrer">
  <link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css">
  <title>百度一下，你就知道</title>
 </head> 
 <body link="#0000cc"> 
  <div id="wrapper"> 
   <div id="head"> 
    <div class="head_wrapper"> 
     <div class="s_form"> 
      <div class="s_form_wrapper"> 
       <div id="lg"> 
        <img hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270" height="129"> 
       </div> 
       <form id="form" name="f" action="//www.baidu.com/s" class="fm"> 
        <input type="hidden" name="bdorz_come" value="1"> 
        <input type="hidden" name="ie" value="utf-8"> 
        <input type="hidden" name="f" value="8"> 
        <input type="hidden" name="rsv_bp" value="1"> 
        <input type="hidden" name="rsv_idx" value="1"> 
        <input type="hidden" name="tn" value="baidu">
        <span class="bg s_ipt_wr"><input id="kw" name="wd" class="s_ipt" value="" maxlength="255" autocomplete="off" autofocus></span>
        <span class="bg s_btn_wr"><input type="submit" id="su" value="百度一下" class="bg s_btn"></span> 
       </form> 
      </div> 
     </div> 
     <div id="u1"> 
      <a href="http://news.baidu.com" name="tj_trnews" class="mnav">新聞</a> 
      <a href="http://www.hao123.com" name="tj_trhao123" class="mnav">hao123</a> 
      <a href="http://map.baidu.com" name="tj_trmap" class="mnav">地圖</a> 
      <a href="http://v.baidu.com" name="tj_trvideo" class="mnav">視訊</a> 
      <a href="http://tieba.baidu.com" name="tj_trtieba" class="mnav">貼吧</a> 
      <noscript> 
       <a href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login" class="lb">登入</a> 
      </noscript> 
      <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登入</a>');</script> 
      <a href="//www.baidu.com/more/" name="tj_briicon" class="bri" style="display: block;">更多產品</a> 
     </div> 
    </div> 
   </div> 
   <div id="ftCon"> 
    <div id="ftConw"> 
     <p id="lh"> <a href="http://home.baidu.com">關於百度</a> <a href="http://ir.baidu.com">About Baidu</a> </p> 
     <p id="cp">?2017&nbsp;Baidu&nbsp;<a href="http://www.baidu.com/duty/">使用百度前必讀</a>&nbsp; <a href="http://jianyi.baidu.com/" class="cp-feedback">意見反饋</a>&nbsp;京ICP證030173號&nbsp; <img src="//www.baidu.com/img/gs.gif"> </p> 
    </div> 
   </div> 
  </div>  
 </body>
</html>

其中包含了，子連結，文字，圖片。這不是我們要的，需要過濾。

正則表示式過濾方法

相信大家都知道正則表示式，在Java中兩個類 Pattern 和 Matcher 都是為它服務的。

        String regexStr = "[abdh]e";
        String targetStr = "hello world";
        //獲取Pattern物件
        Pattern pattern = Pattern.compile(regexStr);
        // 定義一個matcher用來做匹配
        Matcher matcher = pattern.matcher(targetStr);
        if (matcher.find()) {
            System.out.println(matcher.group());
        }

我只是提一下有這個東西，它並不好用。。。所以我們需要用的是Jsoup

Jsoup

使用Jsoup的前提是需要匯入對應的包
Jsoup當中已經封裝了對應的方法來獲取網頁當中的內容：

        String url = "http://www.baidu.com";
        org.jsoup.nodes.Document doc = Jsoup.connect(url).get();

是不是很方便。。。。都在Document中了
而且Jsoup定義了檢索的方法，可以遍歷文件中的內容

比如你需要的是子連結：

        String url = "http://www.baidu.com";
        org.jsoup.nodes.Document doc = Jsoup.connect(url).get();
        Elements links = doc.select("a[href]");
        System.out.println(links.size());
        for (org.jsoup.nodes.Element link : links) {
            System.out.println(link.attr("abs:href") + " " + link.text());
        }

只需要這麼寫，在links中得到的就全部都是子連結的篩選結果，並且通過遍歷的方式展示出來(我從其他網址抓的內容):


http://www.haodou.com/recipe/839235/  豬蹄花生煲
http://www.haodou.com/recipe/287246/  銀耳陳皮生薑燉梨
http://www.haodou.com/recipe/160157/  心太軟
http://www.haodou.com/recipe/260392/  韓式蜂蜜大棗茶
http://www.haodou.com/recipe/296090/  紅棗蓮子銀耳羹
http://www.haodou.com/recipe/3717/  蜂蜜柚子茶
http://www.haodou.com/recipe/271602/  姜棗桂圓湯
http://www.haodou.com/recipe/273075/  紅棗當歸粥
http://www.haodou.com/recipe/267272/  米酒南瓜紅棗湯

上面幾個樣例是在我安卓大作業中使用的內容，從種子連結中選擇所有的連結通過篩選得到我們所需要的：

public static boolean retrieveLink(org.jsoup.nodes.Element link) {
        String url = "http://www.haodou.com/recipe/";
        if (link.attr("abs:href").contains(url)
                && !link.attr("abs:href").contains("#")
                && !link.attr("abs:href").contains("album")
                && !link.attr("abs:href").contains("knowledge")
                && !link.attr("abs:href").contains("all")
                && !link.attr("abs:href").contains("create")
                && !link.attr("abs:href").contains("food")
                && !link.attr("abs:href").contains("category")
                && !link.attr("abs:href").contains("top")
                && !link.attr("abs:href").contains("expert"))
            return true;
        return false;

    }

在每一條子連結中的內容是我們所需要的，所以對應子連結我們需要定義不同的模式來抓取：

public static void getAll(String url) {
        try {
            // 樣例連結
            url = "http://www.haodou.com/recipe/869836/";
            url = "http://www.haodou.com/recipe/326177/";
            org.jsoup.nodes.Document doc = Jsoup.connect(url).get();
//            成品圖
            Elements elements = doc.select(".recipe_cover");
            for (Element element : elements)
                System.out.println(element.attr("abs:src"));
//            步驟圖
            elements = doc.select(".imit_m > img");
            for (Element element : elements)
                System.out.println(element.attr("abs:src"));
//            小貼士
            elements = doc.select(".data");
            for (Element element : elements)
                System.out.println(element.text());
            System.out.println("aaaa");
//            做菜步驟
            Element prompt = doc.select(".prompt span").first();
            System.out.println(prompt.text());
//            獲取菜名
            Element title = doc.select("h1").first();
//            result.append(link.text());
            System.out.println(title.text());
//            菜的簡介
            title = doc.select("[data]").first();
//            result.append(link.text());
            System.out.println(title.text());
//            食材
            Elements links = doc.select(".ingtbur");
            for (Element element : links)
//                result.append(element.text());
                System.out.println(element.text());

//            步驟
            links = doc.select(".sstep");
            for (Element element : links)
//                result.append(element.text());
                System.out.println(element.text());
//            標籤
            links = doc.select("p > a[href]");

            int i = 0;
            for (Element element : links) {
                i++;
                if (element.attr("abs:href").contains("http://www.haodou.com/recipe/all")) {
                    System.out.println(element.toString());
                }
            }
            elements = doc.select(".quantity");
            for (Element element : elements)
                System.out.println(element.text());
            elements = doc.select(".pop_tags a");
            for (Element element : elements)
                System.out.println(element.text());
            Element element = doc.select(".des span").last();
            System.out.println(element.text());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

得到的結果：

869836,炸茄盒炸茄盒外焦裡嫩，入口不膩，咬上一口，淡淡的肉香和茄子的清香同時浮現，讓人心動不已。雞蛋1個姜適量蒜子適量蔥適量鹽適量白糖適量料酒適量醬油適量老抽適量胡椒粉適量麻油適量麵粉適量生粉適量1.首先我們將豬肉剁成肉泥、姜切成姜米、蔥切蔥花、蒜子切成蒜末、茄子去皮，然後在每一小段中間切一刀，但不要切斷，做成茄盒。2.然後我們來製作肉餡：將豬肉泥放入盤中，加入姜米、蒜末、蔥花、少許鹽、少許白糖、適量料酒、少量醬油、少量老抽、少許胡椒粉、淋入食用油、麻油，抓勻，接著再加入生粉，抓打至肉餡變得粘稠。3.將茄夾中間抹上生粉，用肉餡填滿茄夾。4.填滿肉餡後，再逐一將整個茄盒抹上生粉。5.接著我們來調麵糊：準備一個碗，在碗中放入適量麵粉、適量生粉、半勺鹽、一個雞蛋拌勻，再加少許清水，拌至粘稠狀，備用。6.鍋中放入半鍋油，燒至8成熱後，將茄盒放入麵糊中裹上面糊，再逐個下油鍋中，炸至茄盒表面呈金黃色後撈出瀝油。這道菜就完成了。菜譜大全健脾開胃兒童減肥宴請家常菜小吃炸白領私房菜聚會

這只是一個樣例，我一共抓取了6000+道菜，包括圖片和連結資源。

當然如果你想要網頁上的其他內容還有很多可以選擇：
我建議是最好先抓取所有的內容，然後根據html來選擇你要的部分

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); //帶有href屬性的a元素
Elements pngs = doc.select("img[src$=.png]");
  //副檔名為.png的圖片

Element masthead = doc.select("div.masthead").first();
  //class等於masthead的div標籤

Elements resultLinks = doc.select("h3.r > a"); //在h3元素之後的a元素

說明(這個姿勢一部分，更加詳細的請點選參考文獻連結)：

jsoup elements物件支援類似於CSS (或jquery)的選擇器語法，來實現非常強大和靈活的查詢功能。.

這個select 方法在Document, Element,或Elements物件中都可以使用。且是上下文相關的，因此可實現指定元素的過濾，或者鏈式選擇訪問。

Select方法將返回一個Elements集合，並提供一組方法來抽取和處理結果。

Selector選擇器概述
tagname: 通過標籤查詢元素，比如：a
ns|tag: 通過標籤在名稱空間查詢元素，比如：可以用 fb|name 語法來查詢 <fb:name> 元素
#id: 通過ID查詢元素，比如：#logo
.class: 通過class名稱查詢元素，比如：.masthead
[attribute]: 利用屬性查詢元素，比如：[href]
[^attr]: 利用屬性名字首來查詢元素，比如：可以用[^data-] 來查詢帶有HTML5 Dataset屬性的元素
[attr=value]: 利用屬性值來查詢元素，比如：[width=500]
[attr^=value], [attr$=value], [attr*=value]: 利用匹配屬性值開頭、結尾或包含屬性值來查詢元素，比如：[href*=/path/]
[attr~=regex]: 利用屬性值匹配正則表示式來查詢元素，比如： img[src~=(?i)\.(png|jpe?g)]
*: 這個符號將匹配所有元素

Java基於httpclient獲取網頁資料，實現簡單網路爬蟲

1、pom檔案引入httpclient依賴 <dependency> <groupId>org.apache.httpcompon

java基於jsoup實現簡單的圖片爬蟲並下載

2018年11月04日 17:20:32 小小申閱讀數：4 標籤： jsoup java

基於 Echarts實現簡單網路拓撲

option = { title: { text: '簡單網路拓撲' }, tooltip: {}, animationDurationUpdate: 1500, animationEasingUpdate: 'quintic

Java爬蟲-使用HttpClient+Jsoup實現簡單的爬蟲爬取文字

##一、工具介紹 HttpClient是Apache Jakarta Common下的子專案，用來提供高效的、最新的、功能豐富的支援HTTP協議的客戶端程式設計工具包，並且它支援HTTP協議最新的版本和建議。HttpClient已經應用在很多的專案中，比如A

基於Jsoup實現的簡單網路爬蟲

之前是完全不會爬蟲的，但是新專案中需要從網頁上爬一大堆的資料，所以就花了一天時間學習了下。主題部分還是很簡單的。 * 既然想要寫博文，那我就要寫的細緻點，對自己對讀者都是一種負責！什麼是爬蟲？我所理解的爬蟲就是從網際網路上獲取Url，順著U

【Python】簡單網路爬蟲實現

引言網路爬蟲（英語：web crawler），也叫網路蜘蛛（spider），是一種用來自動瀏覽全球資訊網的網路機器人。其目的一般為編纂網路索引。 --維基百科網路爬蟲可以將自己所訪問的頁面儲存下來，以便搜尋引擎事後生成索引供使用者搜尋。一般有兩個步驟：1.獲取網頁內

jsoup+HttpURLConnection+多執行緒實現編寫網路爬蟲

jsoup HttpURLConnection 多執行緒網路爬蟲解析網頁內容開發平臺：Android Studio 3.1內容：利用jsoup解析爬取的頁面內容HttpURLConnection是Java的標準類，它繼承自URLConnection，可用於向指定網站

python實現簡單圖片爬蟲並保存

.com 貪婪模式 web頁面 logs urn 並不是 python 保存 light 先po代碼 #coding=utf-8 import urllib.request #3之前的版本直接用urllib即可，下同 #該模塊提供了web頁面讀取數據的接口，使得我們可以

基於HttpClient實現網絡爬蟲~以百度新聞為例

rom pcl 音頻 lba 瀏覽器中 sts 更新 @override erro 轉載請註明出處：http://blog.csdn.net/xiaojimanman/article/details/40891791 基於HttpClient4.5實現網絡爬蟲

matlab學習 — 實現簡單的爬蟲

mage data- imwrite read dai div 小寫 ranking 解析　　這裏復雜的情況暫時不考慮。。測試網址為pixiv的每日排行榜 = = url = ‘https://www.pixiv.net/ranking.php?mode=daily

Android實戰——jsoup實現網絡爬蟲，糗事百科項目的起步

網絡數據標識爬蟲 android thumb 技術分享由於網絡數界面本篇文章包括以下內容：前言 jsoup的簡介 jsoup的配置 jsoup的使用結語對於Android初學者想要做項目時，最大的煩惱是什麽？毫無疑問是數據源的缺乏，當然可以選

NodeJs實現簡單的爬蟲

1.爬蟲：爬蟲，是一種按照一定的規則，自動地抓取網頁資訊的程式或者指令碼；利用NodeJS實現一個簡單的爬蟲案例，爬取Boss直聘網站的web前端相關的招聘資訊，以廣州地區為例； 2.指令碼所用到的nodejs模組 express 用來搭建

Scrapy框架之基於RedisSpider實現的分散式爬蟲

需求：爬取的是基於文字的網易新聞資料(國內、國際、軍事、航空)。　　基於Scrapy框架程式碼實現資料爬取後，再將當前專案修改為基於RedisSpider的分散式爬蟲形式。一、基於Scrapy框架資料爬取實現 1、專案和爬蟲檔案建立 $ scrapy startproject wangyiPro $

基於ZooKeeper實現簡單的配置中心

ZooKeeper節點的型別分為以下幾類： 1. 持久節點：節點建立後就一直存在，直到有刪除操作來主動刪除該節點 2. 臨時節點：臨時節點的生命週期和建立該節點的客戶端會話繫結，即如果客戶端會話失效（客戶端宕機或下線），這個節點自動刪除 3. 時序節點：建立節點是可以設定這個屬性，Z

python3實現簡單的爬蟲

主要實現的是從百度貼吧爬取一些圖片開啟對應的網頁主要是使用python下的庫urllib request.urlopen() 開啟目標網頁 read() 讀取網頁資訊因此最開始程式碼如下： #coding=utf-8 from urllib impo

Python 3.6 實現簡單的爬蟲

python作為一種新銳語言，他的更新是非常的快的。 3.x與2.x相比，它整合了urllib，urllib2,urllib3等一系列的模組，在3.x裡，實現一個爬取網頁簡易的程式如下# -*- co

Python 實現簡單的爬蟲功能並儲存到本地

昨天下班後忽然興起想寫一個爬蟲抓抓網頁上的東西。花了一個鐘簡單學習了python的基礎語法，然後參照網上的例子自己寫了個爬蟲。 #coding=utf-8 import urllib.request import re import os ''' Urllib 模組提供

基於Java實現簡單Http伺服器之一

本文將詳細介紹如何基於java語言實現一個簡單的Http伺服器，文中將主要介紹三個方面的內容：1）Http協議的基本知識、2）java.net.Socket類、3）java.net.ServerSocket類，讀完本文後你可以把這個伺服器用多執行緒的技術重新編

基於Hadoop 的分散式網路爬蟲技術學習筆記

遍歷的路徑：A-F-G E-H-I B C D 2.寬度優先遍歷策略寬度優先遍歷策略的基本思路是，將新下載網頁中發現的連結直接插入待抓取URL佇列的末尾。也就是指網路爬蟲會先抓取起始網頁中連結的所有網頁，然後再選擇其中的一個連結網頁，繼續抓取在此網頁中連結的所有網頁。還是以上面的圖為例：遍歷路徑：A-B

基於HtmlUnit實現簡單登入、頁面跳轉以及獲取有用資料部分程式碼示例（示例網站：大潤發）

首先，我們將要獲取的目標內容為商戶訂單查詢結果：如下程式碼為登入模組程式碼（由於驗證碼解析這部分目前沒做，只能手動識別）： /** * * @param username 使用者 * @param password

基於Jsoup實現的簡單網路爬蟲

什麼是爬蟲？

正則表示式過濾方法

Jsoup

相關推薦