NekoHTML解析HTML為XML後TagName一直為大寫的問題解決

阿新 • • 發佈：2019-01-20

問題：

java使用NekoHTML解析HTML的時候發現NekoHTML總是把標籤名轉換成大寫，導致之前寫的XPath都用不了，雖然可以用指令碼把之前的歷史XPath都轉換一遍，但是如果新來的運營不知道的話，還是可能會出現不必要的麻煩。

分析：

在網上一頓搜尋，發現自己的blog裡也有寫，只是之前沒有注意，NekoHTML提供了一些配置項，可以精確的配置NekoHTML的行為。與我們這個問題相關的配置是：

DOMParser parser = new DOMParser();
parser.setProperty("http://cyberneko.org/html/properties/names/elems", "match");
//解析HTML檔案
parser.parse("http://www.baidu.com");
 //獲取解析後的DOM樹
Document document = parser.getDocument();

設定以後發現竟然沒有用，關鍵是NekoHTML的官網也上不去，不知道是被牆了還是怎麼。後來幸好在github找到一份映象，找到了文件。文件中這麼寫著：

Why are the DOM element names always uppercase?

The HTML DOM specification explicitly states that element and attribute names follow the semantics, including case-sensitivity, specified in the HTML 4specification. In addition,

section 1.2.1 of the HTML 4.01 specification states:

Element names are written in uppercase letters (e.g., BODY). Attribute names are written in lowercase letters (e.g., lang, onsubmit).

The Xerces HTML DOM implementation (used by default in the NekoHTML DOMParser class) follows this convention. Therefore, even if the "http://cyberneko.org/html/properties/names/elems" property is set to "lower", the DOM will still uppercase the element names.

To get around this problem, instantiate a Xerces2 DOMParser object using the NekoHTML parser configuration. By default, the Xerces DOM parser class creates a standard XML DOM tree, not an HTML DOM tree. Therefore, the element and attribute names will follow the settings for the "http://cyberneko.org/html/properties/names/elems" and "http://cyberneko.org/html/properties/names/attrs" properties. However, realize that the application will not be able to cast the document nodes to the HTML DOM interfaces for accessing the document's information.

The following sample code shows how to instantiate a DOM parser using the NekoHTML parser configuration:

// import org.apache.xerces.parsers.DOMParser;
// import org.cyberneko.html.HTMLConfiguration;

DOMParser parser = new DOMParser(new HTMLConfiguration());

大意就是說為了符合HTML4.01標準，NekoHTML會將TagName轉換為大寫，無論是否設定剛才說的配置項。解決辦法就是使用

org.apache.xerces.parsers.DOMParser代替原來的DOMParser。具體程式碼看下面的解決方案吧

解決方案：

直接插程式碼了：

        HTMLConfiguration htmlConfiguration = new HTMLConfiguration();
        htmlConfiguration.setProperty("http://cyberneko.org/html/properties/names/elems", "match");
        org.apache.xerces.parsers.DOMParser parser = new org.apache.xerces.parsers.DOMParser(htmlConfiguration);
        InputSource inputSource = new InputSource("http://www.baidu.com");
        parser.parse(inputSource);
        System.out.println(parser.getXMLParserConfiguration().getProperty("http://cyberneko.org/html/properties/names/elems"));
        //獲取解析後的DOM樹
        Document document = parser.getDocument();
        String xml = new XMLDocument(document).toString();
        System.out.println(xml);

附上pom.xml的相關依賴

        <dependency>
            <groupId>net.sourceforge.nekohtml</groupId>
            <artifactId>nekohtml</artifactId>
            <version>1.9.22</version>
        </dependency>

NekoHTML解析HTML為XML後TagName一直為大寫的問題解決

問題：

分析：

Why are the DOM element names always uppercase?

解決方案：

NekoHTML解析HTML為XML後TagName一直為大寫的問題解決

解決PyCharm下python使用XPath解析html，獲取文字時中文為亂碼問題

mui 之ajax遇到的坑後臺接受資料為空後端服務為php

php將xml文件轉換為html

使用dataset讀取xml後用dataview排序時為什麽不是按數字類型排序 MQsz

C#:讀取html模板檔案，並替換修改檔案中指定值，儲存為修改後的檔案

有關於html改為jsp後格式不對的問題

unittest 測試報告輸出為xml，html，log

nginx 偽靜態，為沒有後綴名的url新增html字尾

java對xml全面解析，增，刪，改，以及將java物件重新編組為xml檔案

使用Jacob批量轉換word為txt、pdf、xps、html、xml等文件

C# xml通過xslt轉換為html輸出

爬蟲學習4-HTML和XML資料的分析與解析

ajax返回值為xml -解析

使用SAX解析將xml的檔案儲存為java物件

【網路爬蟲】【java】微博爬蟲（四）：資料處理——jsoup工具解析html、dom4j讀寫xml

【Android】pull解析xml檔案+將資料儲存為xml格式，並儲存在記憶體裡

dicom之將dcm檔案解析為jpg後等比例壓縮

通過使用jsoup解析html,繪畫表格生成execl文件

python 解析html網頁

NekoHTML解析HTML為XML後TagName一直為大寫的問題解決

問題：

分析：

Why are the DOM element names always uppercase?

解決方案：

相關推薦