網路爬蟲：使用多執行緒爬取網頁連結

阿新 • • 發佈：2019-01-15

前言：

經過前面兩篇文章，你想大家應該已經知道網路爬蟲是怎麼一回事了。這篇文章會在之前做過的事情上做一些改進，以及說明之前的做法的不足之處。

思路分析：

1.邏輯結構圖

上圖中展示的就是我們網路爬蟲中的整個邏輯思路（呼叫Python解析URL，這裡只作了簡略的展示）。

2.思路說明：

首先，我們來把之前思路梳理一下。之前我們採用的兩個佇列Queue來儲存已經訪問過和待訪問的連結列表，並採用廣度優先搜尋進行遞迴訪問這些待訪問的連結地址。而且這裡使用的是單執行緒操作。在對資料庫的操作中，我們添加了一個輔助欄位cipher_address來進行“唯一”性保證，因為我們擔心MySQL在對過長的url連結操作時會有一些不盡如人意。

我不知道上面這一段能否讓你對之前我們處理Spider的做法有一個大概的瞭解，如果你還沒有太明白這是怎麼一回事。你可以訪問《網路爬蟲初步：從訪問網頁到資料解析》和《網路爬蟲初步：從一個入口連結開始不斷抓取頁面中的網址併入庫》這兩篇文章進行了解。

下面我就來說明一下，之前的做法存在的問題：

1.單執行緒：採用單執行緒的做法，可以說相當不科學，尤其是對付這樣一個大資料的問題。所以，我們需要採用多執行緒來處理問題，這裡會用到多執行緒中的執行緒池。

2.資料儲存方式：如果我們採用記憶體去儲存資料，這樣會有一個問題，因為資料量非常大，所以程式在執行的過種中必然會記憶體溢位。而事實也正是如此：

3.Url去重的方式：如果我們對Url進行MD5或是SHA1進行加密的方式進行雜湊的話，這樣會有一個效率的隱患。不過的確這個問題並不那麼複雜。對效率的影響也很小。不過，還好Java自身就已經對String型的資料有雜湊的函式可以直接呼叫：hashCode()

程式碼及說明：

LinkSpider.java

public class LinkSpider {
	
    private SpiderQueue queue = null;
    
	/**
	 *  遍歷從某一節點開始的所有網路連結
	 * LinkSpider
	 * @param startAddress
	 * 			 開始的連結節點
	 */
	public void ErgodicNetworkLink(String startAddress) {
	    if (startAddress == null) {
            return;
        }
	    
	    SpiderBLL.insertEntry2DB(startAddress);
	    
	    List<WebInfoModel> modelList = new ArrayList<WebInfoModel>();
		queue = SpiderBLL.getAddressQueue(startAddress, 0);
		if (queue.isQueueEmpty()) {
            System.out.println("Your address cannot get more address.");
            return;
        }
		
		ThreadPoolExecutor threadPool = getThreadPool();
		int index = 0;
        boolean breakFlag = false;
        
		while (!breakFlag) {
		    
		    // 待訪問佇列為空時的處理
		    if (queue.isQueueEmpty()) {
		        System.out.println("queue is null...");
		        modelList = DBBLL.getUnvisitedInfoModels(queue.MAX_SIZE);
		        if (modelList == null || modelList.size() == 0) {
                    breakFlag = true;
                } else {
                    for (WebInfoModel webInfoModel : modelList) {
                        queue.offer(webInfoModel);
                        DBBLL.updateUnvisited(webInfoModel);
                    }
                }
		    }
		    
			WebInfoModel model = queue.poll();
			
			if (model == null) {
                continue;
            }
			
			// 判斷此網站是否已經訪問過
			if (DBBLL.isWebInfoModelExist(model)) {
			    // 如果已經被訪問，進入下一次迴圈
			    System.out.println("已存在此網站(" + model.getName() + ")");
				continue;
			}
			
			poolQueueFull(threadPool);
			
			System.out.println("LEVEL: [" + model.getLevel() + "] NAME: " + model.getName());
			SpiderRunner runner = new SpiderRunner(model.getAddress(), model.getLevel(), index++);
			threadPool.execute(runner);
			
			SystemBLL.cleanSystem(index);
			
			// 對已訪問的address進行入庫
			DBBLL.insert(model);
		}
		
		threadPool.shutdown();
	}
	
	/**
	 * 建立一個執行緒池的物件
	 * LinkSpider
	 * @return
	 */
	private ThreadPoolExecutor getThreadPool() {
	    final int MAXIMUM_POOL_SIZE = 520;
        final int CORE_POOL_SIZE = 500;
        return new ThreadPoolExecutor(CORE_POOL_SIZE, MAXIMUM_POOL_SIZE, 3, TimeUnit.SECONDS, new ArrayBlockingQueue<Runnable>(MAXIMUM_POOL_SIZE), new ThreadPoolExecutor.DiscardOldestPolicy());
	}
	
	/**
	 * 執行緒池中的執行緒佇列已經滿了
	 * LinkSpider
	 * @param threadPool
	 *         執行緒池物件
	 */
	private void poolQueueFull(ThreadPoolExecutor threadPool) {
	    while (getQueueSize(threadPool.getQueue()) >= threadPool.getMaximumPoolSize()) {
            System.out.println("執行緒池佇列已滿，等3秒再新增任務");
            try {
                Thread.sleep(2000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
	}
	
	/**
	 * 獲得執行緒池中的活動執行緒數
	 * LinkSpider
	 * @param queue
	 *         執行緒池中承載執行緒的佇列
	 * @return
	 */
	private synchronized int getQueueSize(Queue queue) {
        return queue.size();
    }
	
	/**
	 * 接收一個連結地址，並呼叫Python獲取該連結下的關聯的所有連結list
	 * 將list入庫
	 */
	class SpiderRunner implements Runnable {
	    private String address;
	    private SpiderQueue auxiliaryQueue; // 記錄訪問某一個網頁中解析出的網址
	    
	    private int index;
	    private int parentLevel;
	    
	    public SpiderRunner(String address, int parentLevel, int index) {
	        this.index = index;
	        this.address = address;
	        this.parentLevel = parentLevel;
        }
	    
        public void run() {
            auxiliaryQueue = SpiderBLL.getAddressQueue(address, parentLevel);
            System.out.println("[" + index + "]: " + address);
            DBBLL.insert2Unvisited(auxiliaryQueue, index);
            auxiliaryQueue = null;
        }
    }
}

在上面的ErgodicNetworkLink方法程式碼中，大家可以看到我們已經把使用Queue儲存資料的方式改為使用資料庫儲存。這樣做的好處就是我們不用再為OOM而煩惱了。而且，上面的程式碼也使用了執行緒池。使用多執行緒來執行在呼叫Python獲得連結列表的操作。

而對於雜湊Url的做法，可以參考如下關鍵程式碼：

/**
     * 新增單個model到等待訪問的資料庫中
     * DBBLL
     * @param model
     */
	public static void insert2Unvisited(WebInfoModel model) {
	    if (model == null) {
            return;
        }
	    
        String sql = "INSERT INTO unvisited_site(name, address, hash_address, date, visited, level) VALUES('" + model.getName() + "', '" + model.getAddress() + "', " + model.getAddress().hashCode() + ", " + System.currentTimeMillis() + ", 0, " + model.getLevel() + ");";
        DBServer db = null;
        try {
            db = new DBServer();
            db.insert(sql);
            
            db.close();
        } catch (Exception e) {
            System.out.println("your sql is: " + sql);
            e.printStackTrace();
        } finally {
            db.close();
        }
	}

PythonUtils.java

這個類是與Python進行互動操作的類。程式碼如下：

public class PythonUtils {

	// Python檔案的所在路徑
	private static final String PY_PATH = "/root/python/WebLinkSpider/html_parser.py";
		
	/**
	 * 獲得傳遞給Python的執行引數
	 * PythonUtils
	 * @param address
	 * 			網路連結
	 * @return
	 */
	private static String[] getShellArgs(String address) {
		String[] shellParas = new String[3];
    	shellParas[0] = "python";
    	shellParas[1] = PY_PATH;
    	shellParas[2] = address.replace("\"", "\\\"");
    	
    	return shellParas;
	}
	
	private static WebInfoModel parserWebInfoModel(String info, int parentLevel) {
		if (BEEStringTools.isEmptyString(info)) {
			return null;
		}
		
		String[] infos = info.split("\\$#\\$");
		if (infos.length != 2) {
			return null;
		}
		
		if (BEEStringTools.isEmptyString(infos[0].trim())) {
            return null;
        }
		
		if (BEEStringTools.isEmptyString(infos[1].trim()) || infos[1].trim().equals("http://") || infos[1].trim().equals("https://")) {
            return null;
        }
		
		WebInfoModel model = new WebInfoModel();
		
		model.setName(infos[0].trim());
		model.setAddress(infos[1]);
		model.setLevel(parentLevel + 1);
		
		return model;
	}
	
	/**
	 * 呼叫Python獲得某一連結下的所有合法連結
	 * PythonUtils
	 * @param shellParas
	 * 			傳遞給Python的執行引數
	 * @return
	 */
	private static SpiderQueue getAddressQueueByPython(String[] shellParas, int parentLevel) {
		if (shellParas == null) {
			return null;
		}
		
		Runtime r = Runtime.getRuntime();
		SpiderQueue queue = null;
		
    	try {
			Process p = r.exec(shellParas);
			
			BufferedReader bfr = new BufferedReader(new InputStreamReader(p.getInputStream()));
			
			queue = new SpiderQueue();
			String line = "";
			WebInfoModel model = null;
			while((line = bfr.readLine()) != null) {
//			    System.out.println("----------> from python: " + line);
			    
			    if (BEEStringTools.isEmptyString(line.trim())) {
                    continue;
                }
			    
			    if (HttpBLL.isErrorStateCode(line)) {
                    break;
                }
			    
			    model = parserWebInfoModel(line, parentLevel);
			    if (model == null) {
                    continue;
                }
			    
				queue.offer(model);
			}
			
			model = null;
            line = null;
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
            r = null;
		}
    	
    	return queue;
	}
	
	/**
	 * 呼叫Python獲得某一連結下的所有合法連結
	 * PythonUtils
	 * @param address
	 * 			網路連結
	 * @return
	 */
	public static SpiderQueue getAddressQueueByPython(String address, int parentLevel) {
		return getAddressQueueByPython(getShellArgs(address), parentLevel);
	}
}

遇到的問題：

1.請使用Python2.7

因為Python2.6中HTMLParser還是有一些缺陷的，例如下圖中展示的。不過在Python2.7中，這個問題就不再是問題了。

2.資料庫崩潰了

資料庫崩潰的原因可能是待訪問的資料表中的資料過大引起的。

3.對資料庫的同步操作

上面的做法是對資料庫操作進行同步時出現的問題，如果不進行同步，我們會得到資料庫連線數超過最大連線數的異常資訊。對於這個問題有望在下篇文章中進行解決。

不知道大家對上面的做法有沒有什麼疑問。當然，我希望你有一個疑問就是在於，我們去同步資料庫的操作。當我們開始進行同步的時候就已經說明我們此時的同步只是做了單執行緒的無用功。因為我開始以為對資料庫的操作是需要同步的，資料庫是一個共享資源，需要互斥訪問（如果你學習過“作業系統”，對這些概念應該不會陌生）。實際上還是單執行緒，解決的方法就是不要對資料庫的操作進行同步操作。而這些引發的資料庫連線數過大的問題，會在下篇文章中進行解決。

網路爬蟲：使用多執行緒爬取網頁連結

前言：

思路分析：

1.邏輯結構圖

2.思路說明：

程式碼及說明：

遇到的問題：

1.請使用Python2.7

2.資料庫崩潰了

3.對資料庫的同步操作

網路爬蟲：使用多執行緒爬取網頁連結

Python爬蟲教程：多執行緒爬取電子書

Jsoup簡單例子2.0——多執行緒爬取網頁內的郵箱

python多執行緒爬取網頁

Python爬蟲教程：圖蟲網多執行緒爬取

【Python3爬蟲-爬圖片】多執行緒爬取中國國家地理全站美圖，多圖可以提高你的審美哦

使用threading,queue,fake_useragent,requests ,lxml,多執行緒爬取嗅事百科13頁文字資料,爬蟲案例

Python爬蟲入門教程 10-100 圖蟲網多執行緒爬取

Python爬蟲入門教程 13-100 鬥圖啦表情包多執行緒爬取

Python爬蟲入門教程 14-100 All IT eBooks多執行緒爬取

Python爬蟲入門教程 10-100 圖蟲網多執行緒爬取！

Python 爬蟲多執行緒爬取美女圖片儲存到本地

python爬蟲進階使用多執行緒爬取小說

python簡單爬蟲多執行緒爬取京東淘寶資訊教程

爬蟲記錄（4）——多執行緒爬取圖片並下載

spider----利用多執行緒爬取51job案例

python：多執行緒抓取西刺和快站高匿代理IP

使用python的requests、xpath和多執行緒爬取糗事百科的段子

Python3網路爬蟲：requests+mongodb+wordcloud 爬取豆瓣影評並生成詞雲

Python3網路爬蟲：使用Beautiful Soup爬取小說

網路爬蟲：使用多執行緒爬取網頁連結

前言：

思路分析：

1.邏輯結構圖

2.思路說明：

程式碼及說明：

遇到的問題：

1.請使用Python2.7

2.資料庫崩潰了

3.對資料庫的同步操作

相關推薦