從 API 到 DSL：使用 Kotlin 特性為爬蟲框架進一步封裝

Kotlin API · 發表 2018-09-24 16:21:43

摘要：奇思妙想的女孩.jpg NetDiscovery 是一款基於 Vert.x、RxJava 2 等框架實現的爬蟲框架。一. 如何建立 DSL 領域特定語言（英語：domain-specific language、DSL）指的是專注於某個應...

奇思妙想的女孩.jpg

ofollow,noindex">NetDiscovery 是一款基於 Vert.x、RxJava 2 等框架實現的爬蟲框架。

一. 如何建立 DSL

領域特定語言（英語：domain-specific language、DSL）指的是專注於某個應用程式領域的計算機語言。又譯作領域專用語言。DSL 能夠簡化程式設計過程，提高生產效率的技術，同時也讓非程式設計領域專家的人直接描述邏輯成為可能。

NetDiscovery 本身提供了很多功能的 API，然而它的 DSL 模組是為了讓使用者擁有更多的選擇。

本文討論的 DSL 是內部 DSL。

內部 DSL：通用語言的特定語法，用內部DSL寫成的指令碼是一段合法的程式，但是它具有特定的風格，而且僅僅用到了語言的一部分特性，用於處理整個系統一個小方面的問題。

NetDiscovery 的 DSL 主要是結合 Kotlin 帶接收者的 Lambda、運算子過載、中綴表示式等 Kotlin 語法特性來編寫。

運算子過載、中綴表示式其實很多語言都有，那麼我們著重介紹一下帶接收者的 Lambda。

在介紹 Kotlin 帶接收者的 Lambda 之前，先介紹一下帶接收者的函式型別。

帶接收者的函式型別，例如 A.(B) -> C，其中 A 是接收者型別，B是引數型別，C是返回型別。

例如：

val sum: Int.(Int) -> Int = {
this + it
}

sum 是帶接收者的函式型別，它在使用上類似於擴充套件函式。在函式內部，可以使用this指代傳給呼叫的接收者物件。

而帶接收者的 Lambda 典型代表是 Kotlin 標準庫的擴充套件函式：with 和 apply。

看一下 apply 的原始碼：

public inline fun <T> T.apply(block: T.() -> Unit): T {
contract {
callsInPlace(block, InvocationKind.EXACTLY_ONCE)
}
block()
return this
}

在 apply 函式中，引數 block 是一個帶有接收者的函式型別的引數。

對於 apply 函式的使用，先定義一個 User 物件：

class User{

var name:String?=null
var password: String?=null

override fun toString(): String {
return "name:$name,password=$password"
}
}

然後，使用 apply 函式對 User 的屬性進行賦值：

fun main(args: Array<String>) {

val user = User().apply {

name = "Tony"
password = "123456"
}

println(user)
}

二. Request 的 DSL 封裝

Request 請求包含了爬蟲網路請求 Request 的封裝，例如：url、userAgent、httpMethod、header、proxy 等等。當然，還包含了請求發生之前、之後做的一些事情，類似於AOP。

那麼，我們來看一下使用 DSL 來編寫Request：

val request = request {

url = "https://www.baidu.com/"

httpMethod = HttpMethod.GET

spiderName = "tony"

header {

"111" to "2222"
"333" to "44444"
}

extras {

"tt" to "qqq"
}
}

Spider.create().name("tony").request(request).pipeline(DebugPipeline()).run()

可以看到，Request 使用 DSL 封裝之後，非常簡單明瞭。

下面的程式碼是具體的實現，主要是使用帶接收者的 Lambda、中綴表示式。

package com.cv4j.netdiscovery.dsl

import com.cv4j.netdiscovery.core.domain.Request
import io.vertx.core.http.HttpMethod

/**
 * Created by tony on 2018/9/18.
 */
class RequestWrapper {

private val headerContext = HeaderContext()
private val extrasContext = ExtrasContext()

var url: String? = null

var spiderName: String? = null

var httpMethod: HttpMethod = HttpMethod.GET

fun header(init: HeaderContext.() -> Unit) {

headerContext.init()
}

fun extras(init: ExtrasContext.() -> Unit) {

extrasContext.init()
}

internal fun getHeaderContext() = headerContext

internal fun getExtrasContext() = extrasContext
}

class HeaderContext {

private val map: MutableMap<String, String> = mutableMapOf()

infix fun String.to(v: String) {
map[this] = v
}

internal fun forEach(action: (k: String, v: String) -> Unit) = map.forEach(action)
}

class ExtrasContext {

private val map: MutableMap<String, Any> = mutableMapOf()

infix fun String.to(v: Any) {
map[this] = v
}

internal fun forEach(action: (k: String, v: Any) -> Unit) = map.forEach(action)
}

fun request(init: RequestWrapper.() -> Unit): Request {

val wrap = RequestWrapper()

wrap.init()

return configRequest(wrap)
}

private fun configRequest(wrap: RequestWrapper): Request {

val request =Request(wrap.url).spiderName(wrap.spiderName).httpMethod(wrap.httpMethod)

wrap.getHeaderContext().forEach { k, v ->

request.header(k,v)
}

wrap.getExtrasContext().forEach { k, v ->

request.putExtra(k,v)
}

return request
}

三. SpiderEngine的 DSL 封裝

SpiderEngine 可以管理引擎中的爬蟲，包括爬蟲的生命週期。

下面的例子展示了建立一個 SpiderEngine，並往 SpiderEngine 中新增2個爬蟲(Spider)。其中一個爬蟲是定時地去請求網頁。

val spiderEngine = spiderEngine {

port = 7070

addSpider {

name = "tony1"
}

addSpider {

name = "tony2"
urls = listOf("https://www.baidu.com")
}
}

val spider = spiderEngine.getSpider("tony1")

spider.repeatRequest(10000,"https://github.com/fengzhizi715")
.initialDelay(10000)

spiderEngine.runWithRepeat()

四. Selenium 模組的 DSL 封裝

在我之前的文章為爬蟲框架構建Selenium模組、DSL模組(Kotlin實現) 中，曾舉例使用 NetDiscovery 的 Selenium 模組實現：在京東上搜索我的新書《RxJava 2.x 實戰》，並按照銷量進行排序，然後獲取前十個商品的資訊。

這次，使用 DSL 來實現這個功能：

spider {

name = "jd"

urls = listOf("https://search.jd.com/")

downloader = seleniumDownloader {

path = "example/chromedriver"
browser = Browser.CHROME

addAction {
action = BrowserAction()
}

addAction {
action = SearchAction()
}

addAction {
action = SortAction()
}
}

parser = PriceParser()

pipelines = listOf(PricePipeline())

}.run()

這裡，主要是對 SeleniumDownloader 的封裝。Selenium 模組可以適配多款瀏覽器，而 Downloader 是爬蟲框架的下載器元件，實現具體網路請求的功能。這裡的 DSL 需要封裝所使用的瀏覽器、瀏覽器驅動地址、各個模擬瀏覽器動作(Action)等。

package com.cv4j.netdiscovery.dsl

import com.cv4j.netdiscovery.selenium.Browser
import com.cv4j.netdiscovery.selenium.action.SeleniumAction
import com.cv4j.netdiscovery.selenium.downloader.SeleniumDownloader
import com.cv4j.netdiscovery.selenium.pool.WebDriverPool
import com.cv4j.netdiscovery.selenium.pool.WebDriverPoolConfig

/**
 * Created by tony on 2018/9/14.
 */
class SeleniumWrapper {

var path: String? = null

var browser: Browser? = null

private val actions = mutableListOf<SeleniumAction>()

fun addAction(block: ActionWrapper.() -> Unit) {

val actionWrapper = ActionWrapper()
actionWrapper.block()

actionWrapper?.action?.let {
actions.add(it)
}
}

internal fun getActions() = actions
}

class ActionWrapper{

var action:SeleniumAction?=null
}

fun seleniumDownloader(init: SeleniumWrapper.() -> Unit): SeleniumDownloader {

val wrap = SeleniumWrapper()

wrap.init()

return configSeleniumDownloader(wrap)
}

private fun configSeleniumDownloader(wrap: SeleniumWrapper): SeleniumDownloader {

val config = WebDriverPoolConfig(wrap.path, wrap.browser)
WebDriverPool.init(config)

return SeleniumDownloader(wrap.getActions())
}

除此之外，還對 WebDriver 添加了一些常用的擴充套件函式。例如：

fun WebDriver.elementByXpath(xpath: String, init: WebElement.() -> Unit) = findElement(By.xpath(xpath)).init()

這樣的好處是簡化WebElement的操作，例如下面的 BrowserAction ：開啟瀏覽器輸入關鍵字

package com.cv4j.netdiscovery.example.jd;

import com.cv4j.netdiscovery.selenium.Utils;
import com.cv4j.netdiscovery.selenium.action.SeleniumAction;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;

/**
 * Created by tony on 2018/6/12.
 */
public class BrowserAction extends SeleniumAction{

@Override
public SeleniumAction perform(WebDriver driver) {

try {
String searchText = "RxJava 2.x 實戰";
String searchInput = "//*[@id=\"keyword\"]";
WebElement userInput = Utils.getWebElementByXpath(driver, searchInput);
userInput.sendKeys(searchText);
Thread.sleep(3000);
} catch (InterruptedException e) {
e.printStackTrace();
}

return null;
}
}

而使用了 WebDriver 的擴充套件函式之後，上述程式碼等價於下面的程式碼：

package com.cv4j.netdiscovery.example.jd

import com.cv4j.netdiscovery.dsl.elementByXpath
import com.cv4j.netdiscovery.selenium.action.SeleniumAction
import org.openqa.selenium.WebDriver

/**
 * Created by tony on 2018/9/23.
 */
class BrowserAction2 : SeleniumAction() {

override fun perform(driver: WebDriver): SeleniumAction? {

try {
val searchText = "RxJava 2.x 實戰"
val searchInput = "//*[@id=\"keyword\"]"
driver.elementByXpath(searchInput){

this.sendKeys(searchText)
}

Thread.sleep(3000)
} catch (e: InterruptedException) {
e.printStackTrace()
}

return null
}
}

五. 總結

爬蟲框架github地址： https://github.com/fengzhizi715/NetDiscovery

這裡使用的 DSL 很多情況是對鏈式呼叫的進一步封裝。當然，有人會更喜歡鏈式呼叫，也有人會更喜歡 DSL。但是從 API 到 DSL，個人明細更加喜歡 DSL 的風格。