1. 程式人生 > >Java使用HtmlUnit抓取js渲染頁面

Java使用HtmlUnit抓取js渲染頁面

roo art 插件 println word tcs set webclient ble

需求:

需要采集js渲染的頁面,有些網站的頁面是js渲染的

實現:

基於HtmlUnit實現:

  1. public static void getAjaxPage() throws Exception{
  2. WebClient webClient = new WebClient();
  3. webClient.setJavaScriptEnabled(true);
  4. webClient.setCssEnabled(false);
  5. webClient.setAjaxController(new NicelyResynchronizingAjaxController());
  6. webClient.setTimeout(Integer.MAX_VALUE);
  7. webClient.setThrowExceptionOnScriptError(false);
  8. HtmlPage rootPage = webClient.getPage("http://tt.mop.com/read_14304066_1_0.html");
  9. System.out.println(rootPage.asXml());
  10. }

maven依賴:

  1. <dependency>
  2. <groupId>net.sourceforge.htmlunit</groupId>
  3. <artifactId>htmlunit-core-js</artifactId>
  4. <version>2.9</version>
  5. <scope>compile</scope>
  6. </dependency>
  7. <dependency>
  8. <groupId>net.sourceforge.htmlunit</groupId>
  9. <artifactId>htmlunit</artifactId>
  10. <version>2.9</version>
  11. <scope>compile</scope>
  12. </dependency>

說明:

Nutch插件:nutch-htmlunit用於替換Nutch自身的Http Fetch組件

Java使用HtmlUnit抓取js渲染頁面