java做web抓取

阿新 • • 發佈：2018-08-06

ber htm driver att mon base example drive ebs

就像許多現代科技一樣，從網站提取信息這一功能也有多個框架可以選擇。最流行的有JSoup、HTMLUnit和Selenium WebDriver。我們這篇文章討論JSoup。JSoup是個開源項目，提供強大的數據提取API。可以用它來解析給定URL、文件或字符串中的HTML。它還能操縱HTML元素和屬性。

        <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11 
.3</version>
        </dependency>

 public static void main(String[] args) {

        String html = "<html><head><title>Website title</title></head><body><p>Sample paragraph number 1 </p><p>Sample paragraph number 2</p></body></html>";

        Document doc  
= Jsoup.parse(html);

        System.out.println(doc.title());

        Elements paragraphs = doc.getElementsByTag("p");

        for (Element paragraph : paragraphs) {

            System.out.println(paragraph.text());

        }

調用parse()方法可以解析輸入的HTML，將其變成Document對象。調用該對象的方法就能操縱並提取數據。

在上面的例子中，我們首先輸出頁面的標題。然後，我們獲取所有帶有標簽“p”的元素。然後我們依次輸出每個段落的文本。

運行這段代碼，我們可以得到以下輸出：

Website title

Sample paragraph number 1

Sample paragraph number 2

使用JSoup解析URL

解析URL的方法跟解析字符串有點不一樣，但基本原理是相同的：

public class JSoupExample {

    public static void main(String[] args) throws IOException {

        Document doc = Jsoup.connect("https://www.wikipedia.org").get();

        Elements titles = doc.getElementsByClass("other-project");

            for (Element title : titles) {

                System.out.println(title.text());

        }

    }

}

要從URL抓取數據，需要調用connect()方法，提供URL作為參數。然後使用get()從連接中獲取HTML。這個例子的輸出為：

Commons Freely usable photos & more

Wikivoyage Free travel guide

Wiktionary Free dictionary

Wikibooks Free textbooks

Wikinews Free news source

Wikidata Free knowledge base

Wikiversity Free course materials

Wikiquote Free quote compendium

MediaWiki Free & open wiki application

Wikisource Free library

Wikispecies Free species directory

Meta-Wiki Community coordination & documentation

可以看到，這個程序抓取了所有class為other-project的元素。

public void allLinksInUrl() throws IOException {

        Document doc = Jsoup.connect("https://www.wikipedia.org").get();

        Elements links = doc.select("a[href]");

        for (Element link : links) {

            System.out.println("\nlink : " + link.attr("href"));

            System.out.println("text : " + link.text());

        }

    }

運行結果是一個很長的列表：

使用JSoup解析文件

public void parseFile() throws URISyntaxException, IOException {

        URL path = ClassLoader.getSystemResource("page.html");

        File inputFile = new File(path.toURI());

        Document document = Jsoup.parse(inputFile, "UTF-8");

        System.out.println(document.title());

        //parse document in any way

    }

如果要解析文件，就不需要給網站發送請求，因此不用擔心運行程序會給服務器增添太多負擔。盡管這種方法有許多限制，並且數據是靜態的，因而不適合許多任務，但它提供了分析數據的更合法、更無害的方式。

得到的文檔可以用前面說過的任何方式解析。

java做web抓取

ber htm driver att mon base example drive ebs 就像許多現代科技一樣，從網站提取信息這一功能也有多個框架可以選擇。最流行的有JSoup、HTMLUnit和Selenium WebDriver。我們這篇文章討論JSoup。JSoup

java做web抓取

java做web抓取

別人家的程序員是如何使用 Java 進行 Web 抓取的？

Java使用HtmlUnit抓取js渲染頁面

JAVA實現網頁抓取(htmlunit)

獨家 | 手把手教你用Python進行Web抓取（附程式碼）

java使用jsoup抓取中國知網資料思路與測試記錄

python第十一章從web抓取資訊

教您使用java爬蟲gecco抓取JD全部商品資訊

教您使用java爬蟲gecco抓取JD全部商品資訊（三）

python 從web抓取資訊

Java網頁資料抓取例項

Python學習（從Web抓取資訊）

Java爬蟲——phantomjs抓取ajax動態載入網頁

Java爬蟲網頁抓取圖片

web scraper 抓取資料並做簡單資料分析

使用Chrome Headless 快速實現java版數據的抓取

抓取60000+QQ空間說說做一次數據分析

JAVA使用Gecco爬蟲抓取網頁內容

Jmeter Web 性能測試入門 (二)：Fiddler 抓取 http/https 請求

Winform實現抓取web頁面內容的方法

java做web抓取

相關推薦