1. 程式人生 > >【垂直搜尋引擎搭建10】HtmlParser中Filter實踐

【垂直搜尋引擎搭建10】HtmlParser中Filter實踐

Filter種類:

判斷類Filter:

TagNameFilter
HasAttributeFilter
HasChildFilter
HasParentFilter
HasSiblingFilter
IsEqualFilter

邏輯運算Filter:

AndFilter
NotFilter
OrFilter
XorFilter

其他Filter:

NodeClassFilter
StringFilter
LinkStringFilter
LinkRegexFilter
RegexFilter
CssSelectorNodeFilter

      這裡介紹一下TagNameFilter、HasChildFilter、HasAttributeFilter 和這幾個filter的組合使用方法。

package org.algorithm;

import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.HasChildFilter
; import org.htmlparser.filters.TagNameFilter; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; import org.htmlparser.Node; public class FilterImg { public static void main(String[] args) throws ParserException { Parser parser = new Parser("http://smart.huanqiu.com/roll/2016-08/9351546.html"
); NodeFilter filter = new TagNameFilter("p"); NodeList nodes = parser.extractAllNodesThatMatch(filter); Node source = nodes.elementAt(0); String sou = ""; if(source!=null){ sou = source.toString(); } System.out.println(sou); } }

場景一:
如果你想抓取頁面中帶有圖片的連結,如何實現?方法很簡單,採用一個連結的TagNameFilter,以及 具有圖片的HasChildFilter,最後採用AndFilter將這兩個串聯起來,程式碼如下:

package org.algorithm;

import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.HasChildFilter;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

import org.htmlparser.Node;


public class FilterImg {


    public static void main(String[] args) throws ParserException {
        Parser parser = new Parser("http://smart.huanqiu.com/roll/2016-08/9351546.html");
        NodeFilter filter = new AndFilter(new TagNameFilter ("a"),new HasChildFilter (new TagNameFilter ("img")));
        NodeList nodes = parser.extractAllNodesThatMatch(filter);
        Node source = nodes.elementAt(0);
        String sou = "";
        if(source!=null){
            sou = source.toString();
        }
        System.out.println(sou);
    }
}

場景二:
對於<div class=”f”><li class=”m”>這種型別的頁面程式碼,如何抓取裡面的內容。方式也不難,還是採用三個filter來實現,TagNameFilterHasAttributeFilterAndFilter,程式碼如下:

package org.algorithm;

import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.HasChildFilter;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

import org.htmlparser.Node;


public class FilterImg {


    public static void main(String[] args) throws ParserException {
        Parser parser = new Parser("http://smart.huanqiu.com/roll/2016-08/9351546.html");
        NodeFilter filter = new AndFilter(new TagNameFilter("p"),new HasAttributeFilter("title"));
        NodeList nodes = parser.extractAllNodesThatMatch(filter);
        Node source = nodes.elementAt(0);
        String sou = "";
        if(source!=null){
            sou = source.toString();
        }
        System.out.println(sou);
    }
}