1. 程式人生 > >JavaWEB學習記錄--HtmlUnit爬網頁資料

JavaWEB學習記錄--HtmlUnit爬網頁資料

Java–HtmlUnit爬網頁資料

標籤(空格分隔): java

一直使用免費的SS賬號,但是一定時間都過期,還要手動去換密碼之類的,身為程式設計師,就決定讓這一切都自動化.

htmlunit是一款開源的java 頁面分析工具,讀取頁面後,可以有效的使用htmlunit分析頁面上的內容。專案可以模擬瀏覽器執行,被譽為java瀏覽器的開源實現。最大的優勢可以讓js執行,獲取ajax執行後的結果.

1.抓取準備

這裡寫圖片描述

分析:點選Surge後會出來一個模態框,則模態框中顯示配置的連結地址.這個過程並沒傳送請求,所以連結密碼都是js直接生成的.所以後臺要做的事情,模擬點選Surge,然後等js執行後抓取對應dom裡面的內容.

(該連結點選後,會有一個js把modal內容改為正在獲取中,然後再把生成的結果寫入modal中,所以點選後需要配置js延時,不然會獲取不到正確結果)

對應dom:<div class="modal-body" id="watext">

maven引入:

        <dependency>
            <groupId>net.sourceforge.htmlunit</groupId>
            <artifactId>htmlunit</artifactId>
            <version
>
2.23</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.14</version> </dependency>

2.配置WebClient

WebClient是htmlunit的內建瀏覽器,理解為一個沒有圖形顯示的瀏覽器.需要配置其一些引數.
waitForBackgroundJavaScript

()這個相當重要,不然很可能js還沒執行完,程式碼就去獲取新的頁面內容了,導致沒獲取到正確結果.

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;

/**
 * @author Niu Li
 * @date 2016/10/8
 */
public enum  WebClientUtil {

    INSTANCE;

    public WebClient webClient;

    WebClientUtil() {
        webClient = new WebClient(BrowserVersion.CHROME);
        webClient.getOptions().setUseInsecureSSL(true);//支援https
        webClient.getOptions().setJavaScriptEnabled(true); // 啟用JS直譯器,預設為true
        webClient.getOptions().setCssEnabled(false); // 禁用css支援
        webClient.getOptions().setThrowExceptionOnScriptError(false); // js執行錯誤時,是否丟擲異常
        webClient.getOptions().setTimeout(10000); // 設定連線超時時間 ,這裡是10S。如果為0,則無限期等待
        webClient.getOptions().setDoNotTrackEnabled(false);
        webClient.setJavaScriptTimeout(8000);//設定js執行超時時間
        webClient.waitForBackgroundJavaScript(500);//設定頁面等待js響應時間,
    }
}

3.抓取

思路是獲取整個頁面,然後獲取全部的a標籤(因為Surge本質是個a標籤),再對a標籤遍歷找到內容為Surge的標籤,再模擬點選,獲取頁面結果,分析結果,構造ss的配置檔案gui-config.json,寫入到指定路徑.

構造gui-config.json對應實體類

public class SSModel {

    /**
     * configs : [{""}]
     * index : 8
     * random : false
     * global : false
     * enabled : true
     * shareOverLan : false
     * isDefault : false
     * localPort : 1080
     * pacUrl : null
     * useOnlinePac : false
     * reconnectTimes : 0
     * randomAlgorithm : 0
     * TTL : 0
     * proxyEnable : false
     * proxyType : 0
     * proxyHost : null
     * proxyPort : 0
     * proxyAuthUser : null
     * proxyAuthPass : null
     * authUser : null
     * authPass : null
     * autoban : false
     */

    private int index = 0;
    private boolean random = false;
    private boolean global = false;
    private boolean enabled = true;
    private boolean shareOverLan = false;
    private boolean isDefault = false;
    private int localPort = 1080;
    private String pacUrl;
    private boolean useOnlinePac = false;
    private int reconnectTimes = 0;
    private int randomAlgorithm = 0;
    private int TTL = 0;
    private boolean proxyEnable = false;
    private int proxyType = 0;
    private String proxyHost;
    private int proxyPort = 0;
    private String proxyAuthUser = "";
    private String proxyAuthPass = "";
    private String authUser = "";
    private String authPass = "";
    private boolean autoban = false;

    private List<ConfigsBean> configs;
    //省略get和set
}


public class ConfigsBean {
        private String remarks;
        private String server;
        private int server_port;
        private String password;
        private String method;
        private String obfs;
        private String obfsparam = "";
        private String remarks_base64 = "";
        private boolean tcp_over_udp = false;
        private boolean udp_over_tcp = false;
        private String protocol = "origin";
        private boolean obfs_udp = false;
        private boolean enable = true;
        private String id;
        //省略get和set
}

具體獲取方法:

package cn.mrdear.core;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.DomElement;
import com.gargoylesoftware.htmlunit.html.DomNodeList;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

import java.io.IOException;
import java.util.List;
import java.util.stream.Collectors;

import cn.mrdear.model.ConfigsBean;
import cn.mrdear.util.ModelUtil;

/**
 * @author Niu Li
 * @date 2016/10/8
 */
public class MianVpn {

    private static final java.lang.String HOME_PAGE = "https://www.mianvpn.com";

    public List<ConfigsBean> fetch(WebClient webClient) throws IOException {
        //拿到整個頁面
        final HtmlPage page = webClient.getPage(HOME_PAGE);
        //拿到全部a標籤
        DomNodeList<DomElement> domNodeList = page.getElementsByTagName("a");

        List<ConfigsBean> results = domNodeList.stream()
                //找到內容為Surge的a標籤
                .filter(domElement -> {
                    if (domElement.getTextContent().equals("Surge")) {
                        System.out.println(domElement.getTextContent());
                        return true;
                    }
                    return false;
                })
                //模擬點選,並取出結果
                .map(domElement -> {
                    HtmlPage tempPage = null;
                    try {
                        webClient.waitForBackgroundJavaScript(500);
                        tempPage = domElement.click();
                        //這裡如果仍然獲取不到,可以讓執行緒sleep下,再獲取
                        DomElement surge_url = tempPage.getElementById("surge_url");
                        if (surge_url != null) {
                            String href = surge_url.getAttribute("href");
                            System.out.println(href);
                            //轉換為想要的結果
                            return parseUrl(href);
                        }
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                    return null;
                })
                //過濾掉為null的結果
                .filter(configsBean -> configsBean != null)
                //轉換為list
                .collect(Collectors.toList());
            return results;
    }

    /**
     * https://user.mianvpn.com/api/ss/surge/?host=47.88.188.62&port=10001&method=rc4-md5&pw=9575
     * 解析得到的結果
     */
    private ConfigsBean parseUrl(String url) {
        String paramStr = url.substring(url.indexOf('?')+1);
        String[] paramArr = paramStr.split("&");

        String host = paramArr[0].substring(paramArr[0].indexOf('=')+1);
        Integer port = Integer.parseInt(paramArr[1].substring(paramArr[1].indexOf('=')+1));
        String method = paramArr[2].substring(paramArr[2].indexOf('=')+1);
        String pwd = paramArr[3].substring(paramArr[3].indexOf('=')+1);

        ConfigsBean configsBean = new ConfigsBean();
        configsBean.setRemarks(host);
        configsBean.setServer(host);
        configsBean.setServer_port(port);
        configsBean.setMethod(method);
        configsBean.setPassword(pwd);
        configsBean.setObfs("http_simple");
        configsBean.setId(ModelUtil.generateId());
        return configsBean;
    }
}

上面方法返回一個list集合,所以另起一個主方法呼叫,這樣的話就可以寫多個抓取方法,最後綜合結果.

主呼叫方法:
寫入檔案和讀取檔案,均使用fastjson

public class Main {

    private static final String SS_PATH = "D:\\tools\\翻牆\\gui-config.json";

    public static void main(String[] args) {

        try (final WebClient webClient = WebClientUtil.INSTANCE.webClient;
             InputStream inputStream = new FileInputStream(new File(SS_PATH));
             OutputStream outputStream = new FileOutputStream(new File(SS_PATH));
        ) {

            MianVpn mianVpn = new MianVpn();
            List<ConfigsBean> mianVpns = mianVpn.fetch(webClient);
            for (ConfigsBean vpn : mianVpns) {
                System.out.println(vpn);
            }
            //讀取原配置檔案
            SSModel model = JSON.parseObject(inputStream, null, SSModel.class);
            if (model == null) {
                model = new SSModel();
                model.setConfigs(mianVpns);
            }
            //寫入config那部分.
            JSON.writeJSONString(outputStream, model);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

抓取結果:
這裡寫圖片描述

另外可以再抓取其他網站的賬號密碼,一起再主方法中呼叫

4.使用bat指令碼

該專案打包後是一個jar,每次密碼失效的時候都需要去執行一下.這樣的工作完全可以讓指令碼來替代,寫個bat指令碼執行java -jar XX.jar即可.

@echo off
color 1f
cls
echo.
echo 1獲取賬號
echo.
echo 2退出
echo.
SET t=
SET /P t=請選擇1/2:
IF /I '%t:~0,1%'=='1' GOTO start
IF /I '%t:~0,1%'=='2' GOTO stop
exit

:start
echo 正在獲取,請稍後
java -jar E://jar/mrdear-1.0.jar
start D:\tools\翻牆\ShadowsocksR-dotnet4.0.exe
goto finish

:stop
echo 正在退出,請稍後


goto end
:end
exet

5.遇到其他問題

一開始maven打包後引入的其他jar架包打包不進去,每次都找不到主main入口,後來查了下,需要額外一個外掛才可以執行起來.

該外掛會把啟動方法寫入到MANIFEST.MF當中.

        <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>1.2.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <transformers>
                            //這裡配置主main方法.
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>cn.mrdear.core.Main</mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

6.原始碼地址