使用java實現網路爬蟲

阿新 • • 發佈：2019-01-27

之前學習j2ee的搭建，基本完成了。

接下來想學習下爬蟲技術。要研究一項技術，首先得知道它的原理。

那麼網路爬蟲的原理是什麼呢？

網路爬蟲是一個自動提取網頁的程式，它為搜尋引擎從全球資訊網上下載網頁，是搜尋引擎的重要組成。傳統爬蟲從一個或若干初始網頁的URL開始，獲得初始網頁上的

URL，在抓取網頁的過程中，不斷從當前頁面上抽取新的URL放入佇列,直到滿足系統的一定停止條件。

接下來我會一邊研究網路爬蟲的實現，一邊記錄產生的問題和解決方案。加油吧^_^!!

這裡在網上找了個demo，先給大家看看：以下是利用Java模擬的一個程式，提取新浪頁面上的連結，存放在一個檔案裡：

package testspider;

/**
 * Descriptions
 *
 * @version 2017年3月31日
 * @since JDK1.6
 *
 */
import java.io.BufferedReader;  
import java.io.FileWriter;  
import java.io.IOException;  
import java.io.InputStreamReader;  
import java.io.PrintWriter;  
import java.net.MalformedURLException;  
import java.net.URL;  
import java.net.URLConnection;  
import java.util.regex.Matcher;  
import java.util.regex.Pattern;  
  
public class WebSpider {  
    public static void main(String[] args) {  
        URL url = null;  
        URLConnection urlconn = null;  
        BufferedReader br = null;  
        PrintWriter pw = null;  
        String regex = "http://[\\w+\\.?/?]+\\.[A-Za-z]+";  
        Pattern p = Pattern.compile(regex);  
        try {  
            url = new URL("http://www.sina.com.cn/");  
            urlconn = url.openConnection();  
            pw = new PrintWriter(new FileWriter("f:/url.txt"), true);//這裡我們把收集到的連結儲存在了E盤底下的一個叫做url的txt檔案中  
            br = new BufferedReader(new InputStreamReader(  
                    urlconn.getInputStream()));  
            String buf = null;  
            while ((buf = br.readLine()) != null) {  
                Matcher buf_m = p.matcher(buf);  
                while (buf_m.find()) {  
                    pw.println(buf_m.group());  
                }  
            }  
            System.out.println("獲取成功！");  
        } catch (MalformedURLException e) {  
            e.printStackTrace();  
        } catch (IOException e) {  
            e.printStackTrace();  
        } finally {  
            try {  
                br.close();  
            } catch (IOException e) {  
                e.printStackTrace();  
            }  
            pw.close();  
        }  
    }  
}

建立一個java project把程式碼直接放到裡面，執行之後會抓取新浪的所有URL存放在本地的F：/url.txt中

隨便選擇一條url訪問，比如http://beacon.sina.com.cn/a.gif

是可以得到圖片的，這只是爬蟲的簡單實現，接下來我會深入研究它的實現。

網路爬蟲：

開發工具：eclipse JDK1.6

從網上找的demo並沒有用到伺服器。所以我也不用伺服器了。也不涉及資料庫。把爬到的資訊儲存在本地目錄下。

首先，建一個java工程。第一個類根據URL獲取對應網頁內容。

package webspilder;

import java.io.IOException;

import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils;
    
@SuppressWarnings("deprecation")
public class DownloadPage  
{  
    
     /**  
      * 根據URL抓取網頁內容  
      *   
      * @param url  
      * @return  
      */
     public static String getContentFormUrl(String url)  
     {  
         /* 例項化一個HttpClient客戶端 */
         @SuppressWarnings({"resource"})
		 HttpClient client = new DefaultHttpClient();  
         HttpGet getHttp = new HttpGet(url);  
    
         String content = null;  
    
         HttpResponse response;  
         try
         {  
             /*獲得資訊載體*/
             response = client.execute(getHttp);  
             HttpEntity entity = response.getEntity();  
    
             VisitedUrlQueue.addElem(url);  
    
             if (entity != null)  
             {  
                 /* 轉化為文字資訊 */
                 content = EntityUtils.toString(entity);  
    
                 /* 判斷是否符合下載網頁原始碼到本地的條件 */
                 if (FunctionUtils.isCreateFile(url)) 
                	 //&& FunctionUtils.isHasGoalContent(content) != -1
                 {  
                     FunctionUtils.createFile(FunctionUtils  
                             .getGoalContent(content), url);  
                 }  
             }  
    
         } catch (ClientProtocolException e)  
         {  
             e.printStackTrace();  
         } catch (IOException e)  
         {  
             e.printStackTrace();  
         } finally  
         {  
             client.getConnectionManager().shutdown();  
         }  
             
         return content;  
     }  
    
}

第二個類用正則表示式匹配URL，下載檔案並儲存在本地。如果有資料庫，則可以儲存在資料庫。

package webspilder;


import java.io.BufferedWriter;  
import java.io.File;  
import java.io.FileOutputStream;  
import java.io.IOException;  
import java.io.OutputStreamWriter;  
import java.util.regex.Matcher;  
import java.util.regex.Pattern;  
    
public class FunctionUtils  
{  
    
    /**  
     * 匹配超連結的正則表示式  
     */
    private static String pat = "http://([\\w*\\.]*[\\w*])";  
    private static Pattern pattern = Pattern.compile(pat);  
    
    private static BufferedWriter writer = null;  
    
    /**  
     * 爬蟲搜尋深度  
     */
    public static int depth = 0;  
    
    /**  
     * 以"/"來分割URL,獲得超連結的元素  
     *   
     * @param url  
     * @return  
     */
    public static String[] divUrl(String url)  
    {  
        return url.split("/");  
    }  
    
    /**  
     * 判斷是否建立檔案  
     *   
     * @param url  
     * @return  
     */
    public static boolean isCreateFile(String url)  
    {  
        Matcher matcher = pattern.matcher(url);  
    
        return matcher.matches();  
    }  
    
    /**  
     * 建立對應檔案  
     *   
     * @param content  
     * @param urlPath  
     */
    public static void createFile(String content, String urlPath)  
    {  
        /* 分割url */
        String[] elems = divUrl(urlPath);  
        StringBuffer path = new StringBuffer();  
    
        File file = null;  
        for (int i = 1; i < elems.length; i++)  
        {  
            if (i != elems.length - 1)  
            {  
    
                path.append(elems[i]);  
                path.append(File.separator);  
                file = new File("D:" + File.separator + path.toString());  
    
            }  
    
            if (i == elems.length - 1)  
            {  
                Pattern pattern = Pattern.compile("[\\w*\\.]*[\\w*]");  
                Matcher matcher = pattern.matcher(elems[i]);  
                if ((matcher.matches()))  
                {  
                    if (!file.exists())  
                    {  
                        file.mkdirs();  
                    }  
                    String fileName = elems[i];  
                    file = new File("D:" + File.separator + path.toString()  
                            + File.separator + fileName + ".html");  
                    System.out.println("檔案儲存路徑為："+"D:" + File.separator + path.toString()  
                             + fileName + ".html");
                    try
                    {  
                        file.createNewFile();  
                        writer = new BufferedWriter(new OutputStreamWriter(  
                                new FileOutputStream(file)));  
                        writer.write(content);  
                        writer.flush();  
                        writer.close();  
                        System.out.println("建立檔案成功");  
                    } catch (IOException e)  
                    {  
                        e.printStackTrace();  
                    }  
                    
                }  
            }  
    
        }  
    }  
    
    /**  
     * 獲取頁面的超連結並將其轉換為正式的A標籤  
     *   
     * @param href  
     * @return  
     */
    public static String getHrefOfInOut(String href)  
    {  
        /* 內外部連結最終轉化為完整的連結格式 */
        String resultHref = null;  
    
        /* 判斷是否為外部連結 */
        if (href.startsWith("http://"))  
        {  
            resultHref = href;  
        } else
        {  
            /* 如果是內部連結,則補充完整的連結地址,其他的格式忽略不處理,如：a href="#" */
            if (href.startsWith("/"))  
            {  
                resultHref = "http://www.oschina.net" + href;  
            }  
        }  
    
        return resultHref;  
    }  
    
    /**  
     * 擷取網頁網頁原始檔的目標內容  
     *   
     * @param content  
     * @return  
     */
    public static String getGoalContent(String content)  
    {  
        int sign = content.indexOf("<html");  
        String signContent = content.substring(sign);  
    
        int start = signContent.indexOf("<html");  
        int end = signContent.indexOf("</html>");  
    
        return signContent.substring(start , end+7);  
    }  
    
    /**  
     * 檢查網頁原始檔中是否有目標檔案  
     *   
     * @param content  
     * @return  
     */
    public static int isHasGoalContent(String content)  
    {  
        return content.indexOf("<");  
    }  
    
}

獲取該URL取得頁面中，其他頁面的超連結，用於深度爬蟲和廣度爬蟲。

package webspilder;


public class HrefOfPage  
{  
    /**  
     * 獲得頁面原始碼中超連結  
     */
    public static void getHrefOfContent(String content)  
    {  
        System.out.println("開始");  
        String[] contents = content.split("<a href=\"");  
        for (int i = 1; i < contents.length; i++)  
        {  
            int endHref = contents[i].indexOf("\"");  
    
            String aHref = FunctionUtils.getHrefOfInOut(contents[i].substring(  
                    0, endHref));  
    
            if (aHref != null)  
            {  
                String href = FunctionUtils.getHrefOfInOut(aHref);  
    
                if (!UrlQueue.isContains(href)  
                        && href.indexOf("/code/explore") != -1  
                        && !VisitedUrlQueue.isContains(href))  
                {  
                    UrlQueue.addElem(href);  
                }  
            }  
        }  
    
        System.out.println(UrlQueue.size() + "--抓取到的連線數");  
        System.out.println(VisitedUrlQueue.size() + "--已處理的頁面數");  
    
    }  
    
}

儲存未訪問過的URL，廣度爬蟲時避免重複。

package webspilder;

public class UrlDataHanding implements Runnable  
{  
    /**  
     * 下載對應頁面並分析出頁面對應的URL放在未訪問佇列中。  
     * @param url  
     */
    public void dataHanding(String url)  
    {  
            HrefOfPage.getHrefOfContent(DownloadPage.getContentFormUrl(url));  
    }  
            
    public void run()  
    {  
        while(!UrlQueue.isEmpty())  
        {  
           dataHanding(UrlQueue.outElem());  
        }  
    }  
}

儲存訪問過的URL，廣度爬蟲時避免重複。

package webspilder;

import java.util.HashSet;  

/**  
* 已訪問url佇列  
* @author HHZ  
*  
*/
public class VisitedUrlQueue  
{  
    public static HashSet<String> visitedUrlQueue = new HashSet<String>();  
    
    public synchronized static void addElem(String url)  
    {  
        visitedUrlQueue.add(url);  
    }  
    
    public synchronized static boolean isContains(String url)  
    {  
        return visitedUrlQueue.contains(url);  
    }  
    
    public synchronized static int size()  
    {  
        return visitedUrlQueue.size();  
    }  
}

------------------2017.12.11補充類start-----------------------------------

package webspilder;

/**
 * Descriptions
 *
 * @version 2017年3月31日
 * @author 
 * @since JDK1.6
 *
 */
import java.util.LinkedList;  

public class UrlQueue  
{  
    /**超連結佇列*/
    public static LinkedList<String> urlQueue = new LinkedList<String>();  
        
    /**佇列中對應最多的超連結數量*/
    public static final int MAX_SIZE = 10000;  
        
    public synchronized static void addElem(String url)  
    {  
        urlQueue.add(url);  
    }  
        
    public synchronized static String outElem()  
    {  
        return urlQueue.removeFirst();  
    }  
        
    public synchronized static boolean isEmpty()  
    {  
        return urlQueue.isEmpty();  
    }  
        
    public  static int size()  
    {  
        return urlQueue.size();  
    }  
        
    public  static boolean isContains(String url)  
    {  
        return urlQueue.contains(url);  
    }  
    
}

------------------2017.12.11補充類end-----------------------------------

主函式，執行此函式，開始爬蟲

package webspilder;

import java.sql.SQLException;  

import webspilder.UrlDataHanding;  
import webspilder.UrlQueue;  
    
public class Test  
{  
  public static void main(String[] args) throws SQLException  
  {  
      String url = "http://baidu.com";  
      String url1 = "http://www.sina.com.cn";  
      String url2 = "http://finance.qq.com";  
      String url3 = "http://www.mi.com";  
          
          
      UrlQueue.addElem(url);  
      UrlQueue.addElem(url1);  
      UrlQueue.addElem(url2);  
      UrlQueue.addElem(url3);  
          
      UrlDataHanding[] url_Handings = new UrlDataHanding[10];  
          
          for(int i = 0 ; i < 10 ; i++)  
          {  
              url_Handings[i] = new UrlDataHanding();  
              new Thread(url_Handings[i]).start();  
          }  
    
  }  
}

本文爬取百度，新浪，QQ財經和小米的網頁。成功後儲存在本地的D盤：

執行效果截圖：

我儲存成html檔案了，大家也可以儲存成txt檔案。然後檢視電腦D盤：

發現檔案已經儲存成功。這其中遇到的問題主要是正則表示式的書寫，這個很重要大家要注意。

到這裡只是把對應網站的頁面抓取了下來，那麼怎樣從對應頁面中獲取自己想要的資料呢？

這裡使用了java 的jsoup技術。

問題

你有一個HTML文件要從中提取資料，並瞭解這個HTML文件的結構。

方法

將HTML解析成一個Document之後，就可以使用類似於DOM的方法進行操作。示例程式碼：

File input = new File("/tmp/input.html");

Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Element content = doc.getElementById("content");

Elements links = content.getElementsByTag("a");

for (Element link : links) {

String linkHref = link.attr("href");

String linkText = link.text();

}

我們把上面接取的baidu.con.html用jsoup解析成document物件，然後使用DOM的方法接取我們想要的資料。

比如，我們想要網站中<input>標籤的內容，那就用DOM方法自己獲取把!!

最後用jsoup寫了一個簡單例子：

package webspilder;

import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/**
 * Descriptions
 *
 * @version 2017年4月1日
 * @author 
 * @since JDK1.6
 *
 */
public class Jsouptemp {
	//從本地檔案中獲取
    public static void getHrefByLocal()
    {
        File input = new File("D:\\www.mi.com.html");
        Document doc = null;
        try {
            doc = Jsoup.parse(input,"UTF-8","http://www.oschina.net/"); //這裡後面加了網址是為了解決後面絕對路徑和相對路徑的問題
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        Elements links = doc.select("a[href]");
        for(Element link:links){
            String linkHref = link.attr("href");
            String linkText = link.text();
            System.out.println(linkText+":"+linkHref);
        }
    } 
    public static void main(String[] args) {
    	getHrefByLocal();
	}
}

得到小米官網包含的a標籤連結：

基本就到這裡啦！拜了個拜！^_^

Java實現網路爬蟲001-抓取網頁

package com.okayisoft.okayspider.demo; import org.apache.commons.httpclient.HttpClient; import org.

使用java實現網路爬蟲

之前學習j2ee的搭建，基本完成了。接下來想學習下爬蟲技術。要研究一項技術，首先得知道它的原理。那麼網路爬蟲的原理是什麼呢？網路爬蟲是一個自動提取網頁的程式，它為搜尋引擎從全球資訊網上下載網頁，是搜尋引擎的重要組成。傳統爬蟲從一個或若干初始網頁的URL開始，獲得初始網

基於Java的網路爬蟲實現抓取網路小說（一）

package novel.spider.impl; import java.util.ArrayList; import java.util.List; import org.apache.http.client.methods.CloseableHttpResponse; import org.apa

Java實現網頁爬蟲

public class Demo { public static void main(String[] args) throws Exception { // 本程式內部異常過多為了簡便，不一Try，直接拋給虛擬機器 Long StartTime = Sy

【轉】用Java實現網路語音訊號傳送

本文轉載自部落格：https://www.aliyun.com/jiaocheng/347518.html ----------------------------------------------------------------------------------------------

基於HttpClient4.5實現網路爬蟲

個人部落格站已經上線了，網址 www.llwjy.com ~歡迎各位吐槽~-------------------------------------------------------------------------------------------------

java演算法-網路爬蟲抓取網頁並儲存

從一個URL中讀取網頁,如果是同一個網站的就儲存,URL裡面包含URL列表,繼續抓取,抓完全部使用多執行緒 A執行緒讀取URL內容 B執行緒存檔案 C執行緒解析URL 發現新URL從A執行緒讀取完的內容可以放到一個佇列裡面,B執行緒來讀取,C執行緒解析URL 問題,如果這個佇列

用JAVA實現簡單爬蟲多執行緒抓取

在迴圈爬取得基礎上進行多執行緒爬蟲，本程式中使用的三個執行緒，執行緒為實現runnable介面，並使用物件鎖防止併發共同去訪問同一個物件。讓三個執行緒同時爬去同一個url並且得到的新的url不重複。 import java.io.*; import j

用JAVA實現一個爬蟲，爬取知乎的上的內容（程式碼已無法使用）

在學習JAVA的過程中寫的一個程式，處理上還是有許多問題，爬簡單的頁面還行，複雜的就要跪. 爬取內容主要使用URLConnection請求獲得頁面內容，使用正則匹配頁面內容獲得所需的資訊存入檔案，使用正則尋找這個頁面中可訪問的URL，使用佇列儲存未訪問的URL

總結一下五種實現網路爬蟲的方法（一，基於socket通訊編寫爬蟲）

最近呢，由於實習需要呢，複習一遍爬蟲，前斷時間閉關刷題去了，也會把刷題心得總結成部落格分享給大家，比如java集合類特性及原始碼解析，作業系統資料結構的一些演算法，設計模式等，放心，肯定不會鴿的，雖然可能會晚一點寫。言歸正傳，java實現網路爬蟲一般有五種方法（據我所知，要是

基於HttpClient實現網路爬蟲~以百度新聞為例

在以前的工作中，實現過簡單的網路爬蟲，沒有系統的介紹過，這篇部落格就係統的介紹以下如何使用java的HttpClient實現網路爬蟲。關於網路爬蟲的一些理論知識、實現思想以及策略問題，可以參考百度百科“網路爬蟲”，那裡已經介紹的十分詳細，這裡也不再囉嗦

java實現網路圖片轉換為base64字串

功能需求：將網路圖片轉換為base64字串傳給前端 lg：String url = "https://www.baidu.com/369270f.jpg"; base64轉碼之後： url = "/9j/4AAQSkZJRg......G9AFqq6"; 程式碼實現：

Java實現簡單爬蟲爬取天氣預報

爬蟲爬取網頁的主要流程是： 1.向目標網頁發起請求； 2.對於獲取到的html檔案進行解析； 3.對解析後的資料進行儲存。本次主要是爬取全國城市未來7天的天氣預報，爬取物件為中國天氣網，爬取的資料存入文字中。對於html檔案的解析採用Jsoup結合正則表示式。地區程

.NET實現網路爬蟲

爬蟲的特徵和執行方式 User-Agent：主要用來將我們的爬蟲偽裝成瀏覽器。 Cookie：主要用來儲存爬蟲的登入狀態。連線數：主要用來限制單臺機器與服務端的連線數量。代理IP：主要用來偽裝請求地址，提高單機併發數量。爬蟲工作的方式可以歸納為兩

用C#實現網路爬蟲（一）

1 private void ReceivedData(IAsyncResult ar) 2 { 3 RequestState rs = (RequestState)ar.AsyncState; //獲取引數 4 HttpWebRequest req = rs.Req; 5

python3實現網路爬蟲（3）--BeautifulSoup使用（2）

在這一次的內容中，我們繼續討論BeautifulSoup的一些操作，我們這次只討論幾個在實踐中用處特別大的幾個函式。這次我們將學習通過屬性查詢標籤的方法，標籤組的使用。我們一起回憶一下，基本上，我們見過的每個網站都會使用層疊樣式表（css，不懂的可以補一下網頁相關知識）

python執行緒池實現網路爬蟲

http://blog.daviesliu.net/2006/10/09/234822/ 首先是建立執行緒池：執行緒池主要由兩個佇列維護，執行緒佇列和任務佇列，執行緒佇列存放開啟的執行緒，任務佇列由使用者新增任務，開啟的執行緒一直去任務佇列中獲取任務 import Q

java實現網路互動 get、post方法

終於把要上的課都上完了，強度太大了，累的連部落格都懶得寫了，很久都沒寫過了。。。不過現在得開始逐步總結了！由於近期主要在學網路互動這一塊，那麼這裡就從最基礎的java實現get、post請求說起吧：首先我們得了解http get和post的請求方式，1、get方式下，U

Python實現網路爬蟲

#!/usr/bin/env python # -*- coding: UTF-8 -*- # Author: GuangJun.Lv # Date: 2018/07/06 import urllib2 import json import os import time

python3實現網路爬蟲（2）--BeautifulSoup使用（1）

這一次我們來了解一下美味的湯--BeautifulSoup，這將是我們以後經常使用的一個庫，並且非常的好用。 BeautifuleSoup庫的名字取自劉易斯·卡羅爾在《愛麗絲夢遊仙境》裡的同名詩歌。在故事中，這首歌是素甲魚唱的。就像它在仙境中的說法一樣，BeautifulS

使用java實現網路爬蟲

問題

方法

相關推薦