網路爬蟲：百度百科

阿新 • • 發佈：2019-02-13

爬百度百科的詞條

import urllib.request
import re
from bs4 import BeautifulSoup

def main():

    url="http://baike.baidu.com/view/284853.htm"
    req=urllib.request.Request(url)
    response=urllib.request.urlopen(req)
    html=response.read().decode("utf-8")
    soup=BeautifulSoup(html,"html.parser")#使用python預設的解析器 


    for each in soup.find_all(href=re.compile("view")):
        #print(each.text, "-->","http: // baike.baidu.com"+each["href"])
        print (each.text,"-->","".join(["http://baike.baidu.com",each["href"]]))
        #上邊用join()不用+直接拼接，是因為join()被證明執行效率要高許多

if __name__=="__main__":
    main()

深入：使用者希望輸入任意詞條

import urllib.request
import urllib.parse
import re
from bs4 import BeautifulSoup

def main():
    word=input("請輸入檢索的關鍵詞：")
    keyword=urllib.parse.urlencode({"word":word})
    url="https://baike.baidu.com/search/word?%s" %keyword

    req=urllib.request.Request(url)
    response=urllib.request.urlopen(req)
    html=response.read().decode("utf-8" 
)
    soup=BeautifulSoup(html,"html.parser")

    for each in soup.find_all(href=re.compile("view")):
        print (each.text,"-->","".join(["http://baike.baidu.com",each["href"]]))

if __name__ == "__main__":
    main()

深入：新增副標題

使用者輸入搜尋的關鍵詞，然後爬蟲進入每一條詞條，然後檢查是否有副標題，如果有，將副標題一併列印。

import urllib.request
import urllib.parse
import re
from bs4 import BeautifulSoup

def main():
    word=input("請輸入檢索的關鍵詞：")
    keyword=urllib.parse.urlencode({"word":word})
    url="https://baike.baidu.com/search/word?%s" %keyword

    req=urllib.request.Request(url)
    response=urllib.request.urlopen(req)
    html=response.read().decode("utf-8")
    soup=BeautifulSoup(html,"html.parser")

    for each in soup.find_all(href=re.compile("view")):
        # print (each.text,"-->","".join(["http://baike.baidu.com",each["href"]]))
        content="".join([each.text])
        url2="".join(["http://baike.baidu.com",each["href"]])
        req2=urllib.request.Request(url2)
        response2=urllib.request.urlopen(req2)
        html2=response2.read().decode("utf-8")
        soup2=BeautifulSoup(html2,"html.parser")
        if soup2.h2:
            content="".join([content,soup2.h2.text])
        content="".join([content,"-->",url2])
        print (content)

if __name__ == "__main__":
    main()

深入：先列印10條連結

進一步深入，我們先列印10條連結，然後問使用者“您往下看嗎？”

import urllib.request
import urllib.parse
import re
from bs4 import BeautifulSoup


def test_url(soup):
    result=soup.find(text=re.compile("百度百科尚未收錄詞條"))
    if result:
        print(result[0:-1])
        #百度這個bitch在最後加了一個“符號,去掉 --{百度百科尚未收錄詞條 “}
        return False
    else:
        return True

def summary(soup):

    # word=soup.h1.text
    # #如果存在副標題，一起列印
    # if soup.h2:
    #     word+=soup.h2.text
    # #列印標題
    # print (word)
    # #列印簡介
    # if soup.find(class_="lemma-summary"):
    #     print(soup.find(class_="lemma-summary").text)

    title_node = soup.find("dd", class_="lemmaWgt-lemmaTitle-title").find("h1")
    title = title_node.get_text()
    if soup.h2:
        title+=soup.h2.text
    #列印標題
    print (title)

    # 根據頁面的特徵，獲取摘要內容
    summary_node = soup.find('div', class_="lemma-summary")
    if summary_node is None:
        summary = "None summary"
    else:
        summary = summary_node.get_text()
        print (summary)


def get_urls(soup):
    for each in soup.find_all(href=re.compile("view")):
        content="".join([each.text])
        url2="".join(["http://baike.baidu.com",each["href"]])
        req2=urllib.request.Request(url2)
        response2=urllib.request.urlopen(req2)
        html2=response2.read().decode("utf-8")
        soup2=BeautifulSoup(html2,"html.parser")
        if soup2.h2:
            content="".join([content,soup2.h2.text])
        content="".join([content,"-->",url2])
        yield content

def main():
    word=input("請輸入檢索的關鍵詞：")
    keyword=urllib.parse.urlencode({"word":word})
    url="https://baike.baidu.com/search/word?%s" %keyword

    req=urllib.request.Request(url)
    response=urllib.request.urlopen(req)
    html=response.read().decode("utf-8")
    soup=BeautifulSoup(html,"html.parser")

    if test_url(soup):
        summary(soup)

        print ("下邊列印相關連結：")
        each=get_urls(soup)

        while True:
            try:
                for i in range(10):
                    print (next(each))
            except StopIteration:
                break

            command=input("請輸入任意字元繼續列印，q退出程式：")
            if command=="q":
                break
            else:
                continue

if __name__ == "__main__":
    main()

結果為：
這裡寫圖片描述

網路爬蟲：百度百科

爬百度百科的詞條 import urllib.request import re from bs4 import BeautifulSoup def main(): url="http://baike.baidu.com/view/28

基於HttpClient實現網路爬蟲~以百度新聞為例

在以前的工作中，實現過簡單的網路爬蟲，沒有系統的介紹過，這篇部落格就係統的介紹以下如何使用java的HttpClient實現網路爬蟲。關於網路爬蟲的一些理論知識、實現思想以及策略問題，可以參考百度百科“網路爬蟲”，那裡已經介紹的十分詳細，這裡也不再囉嗦

Python開發簡單爬蟲（二）---爬取百度百科頁面數據

class 實例實例代碼編碼 mat 分享 aik logs title 一、開發爬蟲的步驟 1.確定目標抓取策略：打開目標頁面，通過右鍵審查元素確定網頁的url格式、數據格式、和網頁編碼形式。 ①先看url的格式, F12觀察一下鏈接的形式;② 再看目標文本信息的

爬蟲實例——爬取python百度百科相關一千個詞條

管理器 name 詞條 enc aik lib cnblogs response ons 調度器： import url_manager,html_downloader,html_parser,html_outputer class SpiderMain(object

使用Python的BeautifulSoup庫實現一個可以爬取1000條百度百科數據的爬蟲

otto 提取數據 tps summary 簡介標題格式段落字典如果 BeautifulSoup模塊介紹和安裝 BeautifulSoup BeautifulSoup是Python的第三方庫，用於從HTML或XML中提取數據，通常用作於網頁的解析器 Beauti

最最簡單的python爬蟲教程--爬取百度百科案例

python爬蟲；人工智能from bs4 import BeautifulSoupfrom urllib.request import urlopenimport reimport randombase_url = "https://baike.baidu.com"#導入相關的包 his

Python爬蟲實戰專案1 | 基礎爬蟲的實現（爬取100條百度百科詞條）

【基礎爬蟲篇】本篇講解一個比較簡單的Python爬蟲。這個爬蟲雖然簡單，但五臟俱全，大爬蟲有的模組這個基礎爬蟲都有，只不過大爬蟲做的更全面、多樣。 1.實現的功能：這個爬蟲實現的功能為爬取百度百科中的詞條資訊。爬取的結果見6。 2.背景知識：(1).Python語法；(2).Be

百度百科多執行緒爬蟲(Java)

BaiduBaikeSpider 百度百科多執行緒爬蟲Java原始碼，資料儲存採用了Oracle11g 簡介採用了MyEclipes作為整合開發環境，應該是相容eclips 使用方法下載此原始碼之後使用（匯入或者 import）操作匯入此專案各個類介紹

Scrapy爬蟲實戰：百度搜索找到自己

Scrapy爬蟲實戰：百度搜索找到自己背景分析怎麼才算找到了自己怎麼才能拿到百度搜索標題怎麼爬取更多頁面 baidu_search.py 宣告BaiDuSearchItem Items

清法網路：百度seo做好了，其他搜尋引擎也會有排名嗎

seo的工作，主要是研究搜尋引擎的喜好和規律，將想要的內容呈現在搜尋引擎合適的位置。很多seo初學者會有這麼一個認知：“做seo就是做百度關鍵詞優化”，這個觀點，在上海seo企業網站優化公司清法網路看來，是片面的。為什麼？舉個栗子，智慧手機等同於iphone嗎？當然不是。因為智慧手機還有很多品牌：小米，華為，

Python開發爬蟲爬取百度百科詞條資訊(原始碼下載)

下面使用Python開發一個網頁爬蟲，爬取百度百科詞條資訊，整個程式涉及到url管理器，html下載器，html解析器，html顯示以及排程程式：程式結構： spider_main.py：爬蟲的排

Python奇技淫巧之利用協程加速百度百科詞條爬蟲

前一個系列文章主要利用百度AI的Python SDK進行影象識別、語音合成、語音識別，實現了一些有趣的小案例，實際上百度AI的功能遠不止這些，更多高逼格的東西例如NLP、輿情分析、知識圖譜等有待大家進一步發掘。學習Python中有不明白推薦加入交流群 &

Python3爬蟲之四簡單爬蟲架構【爬取百度百科python詞條網頁】

前面介紹了Python寫簡單的爬蟲程式，這裡參考慕課網Python開發簡單爬蟲總結一下爬蟲的架構。讓我們的爬蟲程式模組劃分更加明確，程式碼具有更佳的邏輯性、可讀性。因此，我們可以將整個

python3 爬蟲學習-根據關鍵詞爬取百度百科內容

小白編了好久才寫出來，記錄一下免得之後再用的時候都忘了還得重新學~ 學習爬蟲最開始是學習了慕課上的python課程，然後學習了慕課和網易雲上的爬蟲教程。這兩個自己去查一下就好了~ 開始還比較費勁，畢竟熟悉需要時間麼，而且python也不太熟悉。關於python版本：我一開

基於HttpClient實現網絡爬蟲~以百度新聞為例

rom pcl 音頻 lba 瀏覽器中 sts 更新 @override erro 轉載請註明出處：http://blog.csdn.net/xiaojimanman/article/details/40891791 基於HttpClient4.5實現網絡爬蟲

多線程爬取百度百科

lib item put 腳本 mit sin find client rtl 前言：EVERNOTE裏的一篇筆記，我用了三個博客才學完...真的很菜...百度百科和故事網並沒有太過不一樣，修改下編碼，debug下，就可以爬下來了，不過應該是我爬的東西太初級了，而且我爬到

Java 面試題：百度前200頁都在這裏了

serializa 負載第三方 lin 目的 safe 並排原理 java虛擬機基本概念操作系統中 heap 和 stack 的區別什麽是基於註解的切面實現什麽是對象/關系映射集成模塊什麽是 Java 的反射機制什麽是 ACID BS與CS的聯系與區別

iOS：百度長語音識別具體的封裝：識別、播放、進度刷新

stat app span nsdata cst 放音 datawit har resp 一、介紹以前做過訊飛語音識別，比較簡單，識別率很不錯，但是它的識別時間是有限制的，最多60秒。可是有的時候我們需要更長的識別時間，例如朗誦古詩等功能。當然訊飛語音也是可以通過曲線救

小象慢跑：百度霸屏技術！曝光率輕松上萬！

排名基本上豆瓣成功財富感覺賬戶標點符號問答自我開通頭條號以來，有好多人問我聯系方式，甚至有些人覺得我很奇怪，99%做互聯網的都會留下自己的信息，唯獨我啥都不留。我在這裏先說一句，我開通頭條號的初衷是分享，分享我的經驗，把我有價值的幹貨分享給大家，大家好才是

Vue2.0 demo：百度百聘第三方web客戶端

項目 tps 2.0 aip gpo 三方 .com web客戶端 git github地址：https://github.com/axel10/baipin_vue 項目地址：https://vcollection.org/baipin/Vue2.0 demo：百度百聘第

網路爬蟲：百度百科

爬百度百科的詞條

深入：使用者希望輸入任意詞條

深入：新增副標題

深入：先列印10條連結

相關推薦