百度搜索結果HTML分析

阿新 • • 發佈：2018-03-16

lpar 查找需求搜索結果格式化工具 all AI tom www

目的：

為了從搜索結果中提取所有網頁，以備後續處理。

訪問百度鏈接分析

名稱	值	說明
wd	任意文字	關鍵字
rn	可以不指定，默認為10，最大為50，最小為1，可設置為任意值	一頁包含的結果條目數
pn	百度默認顯示760條，所以最後一頁為pn=750	第一條結果的索引位置

示例：

https://www.baidu.com/s?wd=老虎&pn=10&rn=3

關鍵字：老虎，第10條記錄，每頁顯示3條。所以打開的是以老虎為關鍵字，第四頁的記錄

HTML源文件分析

剛下載的html源文件格式非常混亂，可使用在線html格式化工具進行格式化，以便閱讀。

根據我的需求，在HTML文件中，<script>元素與<style>元素可以直接跳過。找到搜索結果所在的位置即可。

技術分享圖片

提取搜索結果（QT實現）

在Qt中，使用QDomDocument 或 QXmlStreamReader 來解析 html 文件都失敗了。經分析，其原因是：QDomDocument 或 QXmlStreamReader都是針對解析XML文件設計的。HTML與XML的區別

經過查找資料，TidyLib 庫正好可以解決問題。

Tidy is a console application for Mac OS X, Linux, Windows, UNIX, and more. It corrects and cleans up HTML and XML documents by fixing markup errors and upgrading legacy code to modern standards.

libtidy

is a C static and dynamic library that developers can integrate into their applications in order to bring all of Tidy’s power to your favorite tools. libtidy is used today in desktop applications, web servers, and more.

TidyLib使用代碼如下：

bool HtmlParse::setDatas(const QByteArray &datas)
{
     
bool result = false;
    TidyBuffer output = {0};
    TidyBuffer errbuf = {0};
    int rc = -1;
    Bool ok;

    TidyDoc tdoc = tidyCreate();                        // Initialize "document"

    ok = tidyOptSetBool( tdoc, TidyXhtmlOut, yes );     // Convert to XHTML
    if ( ok )
         rc = tidySetErrorBuffer( tdoc, &errbuf );      // Capture diagnostics
    if ( rc >= 0 )
         rc = tidyParseString( tdoc, datas.data() );    // Parse the input
    if ( rc >= 0 )
         rc = tidyCleanAndRepair( tdoc );               // Tidy it up!
    if ( rc >= 0 )
         rc = tidyRunDiagnostics( tdoc );               // Kvetch
    if ( rc > 1 )                                       // If error, force output.
         rc = ( tidyOptSetBool(tdoc, TidyForceOutput, yes) ? rc : -1 );
    if ( rc >= 0 )
         rc = tidySaveBuffer(tdoc, &output);            // Pretty Print

    if ( rc >= 0 )
    {
        if (doc.setContent(QByteArray((char *)output.bp)))  // QDomDocument doc; 
        {
            result = true;
        }
    }

    tidyBufFree( &output );
    tidyBufFree( &errbuf );
    tidyRelease( tdoc );

    return result;
}

百度搜索結果HTML分析

lpar 查找需求搜索結果格式化工具 all AI tom www 目的：為了從搜索結果中提取所有網頁，以備後續處理。訪問百度鏈接分析名稱值說明 wd 任意文字關鍵字 rn 可以不指定，默認為10，最大為50，最小為1，可設置為任意值一頁包

百度搜索結果HTML分析

目的：

訪問百度鏈接分析

示例：

HTML源文件分析

提取搜索結果（QT實現）

百度搜索結果HTML分析

C# 百度搜索結果xpath分析

PHP網路爬蟲實踐：抓取百度搜索結果，並分析資料結構

python爬取百度搜索結果ur匯總

利用百度搜索結果爬取郵箱

selenium-webdriver循環點擊百度搜索結果以及獲取新頁面的handler

python3 學習2（分頁翻看百度搜索結果）

如何讓百度搜索結果顯示網站 logo

HttpClient 實現爬取百度搜索結果（自動翻頁）

對百度搜索法的分析評價

百度搜索結果爬蟲

PHP多程序抓取百度搜索結果

pyhon3爬取百度搜索結果

如何刪除百度搜索結果_如何刪除百度快照

百度搜索結果屏蔽百家號方法

Python+selenium+PhantomJS獲取百度搜索結果真實連結地址

百度搜索結果如何屏蔽百家號內容

python采集百度搜索結果帶有特定URL的鏈接

【數據分析】python分析百度搜索關鍵詞的頻率

HTML+CSS第四課2：利用表單標籤製作一個百度搜索框

百度搜索結果HTML分析

目的：

訪問百度鏈接分析

示例：

HTML源文件分析

提取搜索結果（QT實現）

相關推薦