crawler4j原始碼分析（五）Robots協議

阿新 • • 發佈：2019-01-08

本節來看看crawler4j是如何支援robots協議的。對robots協議的支援主要目的就是遵守禮貌爬取，即：按照伺服器制定的規則來爬取，只抓取允許抓取的，不讓抓的不抓。

在crawler4j中對robots的支援包括如下幾個類：RobotstxtConfig，RobotstxtServer，HostDirectives，RuleSet，RobotstxtParser。

首先簡單介紹一下上述幾個類的主要作用：

RobotstxtConfig：存放了robots相關的配置，如是否開啟robots，cache的大小，以及user-agent。。

RobotstxtServer：robots的主類，存放cache規定大小的robots項，實現對外的robots功能，後面詳解介紹

HostDirectives：一個host的robots記錄，實現具體的robots規則。

RuleSet：存放具體的robots條目，包括:allow和disallow。

RobotstxtParser：負責將得到的robots檔案解析成具體的robots條目。

下面我們從如下三個方面來展開分析：

robots記錄的訪問，快取的更新策略，記錄的存放。

Robots記錄的訪問：

整個robots模組對外的介面只有一個，即：RobotstxtServer.allows(WebURL webURL)，這個介面在兩個地方被呼叫，一個是往Frontier中增加seedUrl時，另個是將新發現

的Url往Frontier中增加的時候。而這個函式就實現了對robots快取的訪問，在RobotstxtServer中robots記錄按照host為單位來存放：

protected final Map<String, HostDirectives> host2directivesCache = new HashMap<>();

下面就看看allows函式的實現流程，先看原始碼：

public boolean allows(WebURL webURL) {
		if (!config.isEnabled()) {
			return true;
		}
		try {
			URL url = new URL(webURL.getURL());
			String host = getHost(url);
			String path = url.getPath();

			HostDirectives directives = host2directivesCache.get(host);

			if (directives != null && directives.needsRefetch()) {
				synchronized (host2directivesCache) {
                                        //這裡似乎加上doublecheck更為保險一些，因為host2directivesCache為多執行緒共同訪問，如果
                                        //有兩個執行緒同時對同一個host的url操作，第一個執行緒獲取到同步鎖後刪除了該host的快取項，第二個執行緒再刪除則會報異常  
                                        host2directivesCache.remove(host);
					directives = null;
				}
			}

			if (directives == null) {
				directives = fetchDirectives(url);
			}
			return directives.allows(path);//robots記錄都是路徑，因此這裡的入參也是路徑
		} catch (MalformedURLException e) {
			e.printStackTrace();
		}
		return true;
	}

原始碼很簡潔，首先提取給定URL的host，然後基於該host去查詢對應的HostDirectives，如果不為空並且需要執行更新操作，則執行更新，為空的話，則呼叫fetchDirectives

函式去獲取對應的robots記錄，然後通過HostDirectives.allows去判斷是否允許訪問，快取的更新和獲取稍後放在快取的管理來分析，這裡我們主要看看HostDirectives.allows函式都做了什麼。

	public boolean allows(String path) {
		timeLastAccessed = System.currentTimeMillis();
        return !disallows.containsPrefixOf(path) || allows.containsPrefixOf(path);
    }

在HostDirectives中allow條目和disallow條目分別用一個RuleSet來存放，如果disallow不包含或者allow包含即可認為當前路徑允許訪問，否則禁止訪問。

disallows和allows的內部查詢通過TreeSet來實現，實際上RulSet就是對TreeSet的簡單封裝：

public class RuleSet extends TreeSet<String>

	public boolean containsPrefixOf(String s) {
		SortedSet<String> sub = headSet(s);
		// because redundant prefixes have been eliminated,
		// only a test against last item in headSet is necessary
		if (!sub.isEmpty() && s.startsWith(sub.last())) {
			return true; // prefix substring exists
		} 
		// might still exist exactly (headSet does not contain boundary)//因為headSet結果並不包括入參本身，因此還需要繼續判斷
		return contains(s); 
	}

由於TreeSet預設按照字串比較升序存放，因此路勁的存放自然是字首在前，上面程式碼解釋很清楚，這裡不再進一步解釋。
快取的更新策略

快取的更新主要包括失效管理和替換策略，所謂失效管理，即對每個快取項（每個host的robots記錄）記錄建立時間和最近一次訪問的時間，每次訪問時比較該時間間隔是否已經大於設定的失效時常，如果是則刪除該快取項，重新獲取。請看程式碼：

// If we fetched the directives for this host more than
	// 24 hours, we have to re-fetch it.
	private static final long EXPIRATION_DELAY = 24 * 60 * 1000L;
    public boolean needsRefetch() {
        return (System.currentTimeMillis() - timeFetched > EXPIRATION_DELAY);
    }

    public boolean allows(String path) {
        timeLastAccessed = System.currentTimeMillis();
        return !disallows.containsPrefixOf(path) || allows.containsPrefixOf(path);
}

在RobotstxtServer.allows函式中每次訪問時都會判斷是否失效：

if (directives != null && directives.needsRefetch()) {
				synchronized (host2directivesCache) {
					host2directivesCache.remove(host);
					directives = null;
				}
			}

再來看看替換策略：

當快取的大小達到設定的大小時，就將當前快取中最“古老”的快取替換掉，這裡的“古老”即截止當前建立時間最早。

		synchronized (host2directivesCache) {
			if (host2directivesCache.size() == config.getCacheSize()) {
				String minHost = null;
				long minAccessTime = Long.MAX_VALUE;
				for (Entry<String, HostDirectives> entry : host2directivesCache.entrySet()) {
					if (entry.getValue().getLastAccessTime() < minAccessTime) {
						minAccessTime = entry.getValue().getLastAccessTime();
						minHost = entry.getKey();
					}
				}
				host2directivesCache.remove(minHost);
			}
			host2directivesCache.put(host, directives);
		}

記錄的存放

每個Host的robots記錄存放在一個HostDirectives中，而HostDirectives包含兩個RuleSet（繼承自TreeSet），一個是allows集合，一個是disallows集合。因為每個robots記錄是path的形式，而根據robots協議，字首路徑遮蔽所有後續路徑，而基於String本身的比較規則，具有最短的字首路徑排序結果較小，從而在TreeSet中排在前面，基於此，對於具有相同字首的路徑，只需要儲存最短的一個即可，因此選擇TreeSet儲存robots記錄具有天然的優勢。這裡我們主要看看往集合中新增一條新的條目的過程：

	@Override
	public boolean add(String str) {
                //字首已經存在，則不必再新增
                SortedSet<String> sub = headSet(str);
		if (!sub.isEmpty() && str.startsWith(sub.last())) {
			// no need to add; prefix is already present
			return false;
		}
		boolean retVal = super.add(str);
		//如果不+"\0"則也會把自己查出來，基於字首遮蔽規則，所有依次為字首的路徑需要全部刪除
                sub = tailSet(str + "\0");
		while (!sub.isEmpty() && sub.first().startsWith(str)) {
			// remove redundant entries
			sub.remove(sub.first());
		}
		return retVal;
	}

這裡的add函式重寫了TreeSet自帶的函式，因為自帶的函式會將/a/b/c和/a/b/c/d都新增進來，而基於robots規則，只需要儲存/a/b/c即可，因此需要重寫。

crawler4j原始碼分析（五）Robots協議

crawler4j原始碼分析（五）Robots協議

Flume NG原始碼分析（五）使用ThriftSource通過RPC方式收集日誌

OpenCV學習筆記（31）KAZE 演算法原理與原始碼分析（五）KAZE的原始碼優化及與SIFT的比較

GCC原始碼分析（五）——指令生成

YOLOv2原始碼分析（五）

AFNetWorking(3.0)原始碼分析（五）——AFHTTPRequestSerializer & AFHTTPResponseSerializer

轉載：GCC原始碼分析（五）——指令生成

vlc原始碼分析（五）流媒體的音視訊同步

libevent原始碼分析（五）

mochiweb原始碼分析（五）

RxJava2.0中flatMap操作符用法和原始碼分析（五）

spring4.2.9 java專案環境下ioc原始碼分析（五）——refresh之obtainFreshBeanFactory方法（@3預設標籤import,alias解析）

ZMQ原始碼分析（五） --TCP通訊

python3.6 原始碼分析（五）：類的建立

Java多執行緒之ThreadPoolExecutor實現原理和原始碼分析（五）

Libevent原始碼分析（五）--- evbuffer的基本操作

Giraph 原始碼分析（五）—— 載入資料+同步總結

Tomcat原始碼分析（五）----- Tomcat 類載入器

mybatis 原始碼分析（五）Interceptor 詳解

Netty原始碼分析（五）----- 資料如何在 pipeline 中流動

crawler4j原始碼分析（五）Robots協議

相關推薦