python下很帥氣的爬蟲包 - Beautiful Soup 示例

阿新 • • 發佈：2017-10-10

如何 lan linux下 csdn bottom 數量 ... 安裝包一個

先發一下官方文檔地址。http://www.crummy.com/software/BeautifulSoup/bs4/doc/

建議有時間可以看一下python包的文檔。

Beautiful Soup 相比其他的html解析有個非常重要的優勢。html會被拆解為對象處理。全篇轉化為字典和數組。

相比正則解析的爬蟲，省略了學習正則的高成本。

相比xpath爬蟲的解析，同樣節約學習時間成本。雖然xpath已經簡單點了。（爬蟲框架Scrapy就是使用xpath）

安裝

linux下可以執行

apt-get install python-bs4

也可以用python的安裝包工具來安裝

easy_install beautifulsoup4  
  
pip install beautifulsoup4

使用簡介

下面說一下BeautifulSoup 的使用。

解析html需要提取數據。其實主要有幾點

1：獲取指定tag的內容。

<p>hello, watsy</p><br><p>hello, beautiful soup.</p>

2：獲取指定tag下的屬性。

<a href="http://blog.csdn.net/watsy">watsy‘s blog</ 
a>

3：如何獲取，就需要用到查找方法。

html_doc = """  
<html><head><title>The Dormouse‘s story</title></head>  
<body>  
<p class="title"><b>The Dormouse‘s story</b></p>  
  
<p class="story">Once upon a time there were three little sisters; and their names were  
 
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,  
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and  
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;  
and they lived at the bottom of a well.</p>  
  
<p class="story">...</p>  
"""

格式化輸出。

from bs4 import BeautifulSoup  
soup = BeautifulSoup(html_doc)  
  
print(soup.prettify())  
# <html>  
#  <head>  
#   <title>  
#    The Dormouse‘s story  
#   </title>  
#  </head>  
#  <body>  
#   <p class="title">  
#    <b>  
#     The Dormouse‘s story  
#    </b>  
#   </p>  
#   <p class="story">  
#    Once upon a time there were three little sisters; and their names were  
#    <a class="sister" href="http://example.com/elsie" id="link1">  
#     Elsie  
#    </a>  
#    ,  
#    <a class="sister" href="http://example.com/lacie" id="link2">  
#     Lacie  
#    </a>  
#    and  
#    <a class="sister" href="http://example.com/tillie" id="link2">  
#     Tillie  
#    </a>  
#    ; and they lived at the bottom of a well.  
#   </p>  
#   <p class="story">  
#    ...  
#   </p>  
#  </body>  
# </html>

獲取指定tag的內容

soup.title  
# <title>The Dormouse‘s story</title>  
  
soup.title.name  
# u‘title‘  
  
soup.title.string  
# u‘The Dormouse‘s story‘  
  
soup.title.parent.name  
# u‘head‘  
  
soup.p  
# <p class="title"><b>The Dormouse‘s story</b></p>  
  
soup.a  
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

上面示例給出了4個方面

1：獲取tag

soup.title

2：獲取tag名稱

soup.title.name

3：獲取title tag的內容

soup.title.string

4：獲取title的父節點tag的名稱

soup.title.parent.name

怎麽樣，非常對象化的使用吧。

提取tag屬性

下面要說一下如何提取href等屬性。

soup.p[‘class‘]  
# u‘title‘

獲取屬性。方法是

soup.tag[‘屬性名稱‘]

<a href="http://blog.csdn.net/watsy">watsy‘s blog</a>

常見的應該是如上的提取聯接。

代碼是

soup.a[‘href‘]

相當easy吧。

查找與判斷

接下來進入重要部分。全文搜索查找提取.

soup提供find與find_all用來查找。其中find在內部是調用了find_all來實現的。因此只說下find_all

def find_all(self, name=None, attrs={}, recursive=True, text=None,  
                 limit=None, **kwargs):

看參數。

第一個是tag的名稱，第二個是屬性。第3個選擇遞歸，text是判斷內容。limit是提取數量限制。**kwargs 就是字典傳遞了。。

舉例使用。

tag名稱  
soup.find_all(‘b‘)  
# [<b>The Dormouse‘s story</b>]  
  
正則參數  
import re  
for tag in soup.find_all(re.compile("^b")):  
    print(tag.name)  
# body  
# b  
  
for tag in soup.find_all(re.compile("t")):  
    print(tag.name)  
# html  
# title  
  
列表  
soup.find_all(["a", "b"])  
# [<b>The Dormouse‘s story</b>,  
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,  
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,  
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
  
函數調用  
def has_class_but_no_id(tag):  
    return tag.has_attr(‘class‘) and not tag.has_attr(‘id‘)  
  
soup.find_all(has_class_but_no_id)  
# [<p class="title"><b>The Dormouse‘s story</b></p>,  
#  <p class="story">Once upon a time there were...</p>,  
#  <p class="story">...</p>]  
  
tag的名稱和屬性查找  
soup.find_all("p", "title")  
# [<p class="title"><b>The Dormouse‘s story</b></p>]  
  
tag過濾  
soup.find_all("a")  
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,  
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,  
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
  
tag屬性過濾  
soup.find_all(id="link2")  
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]  
  
text正則過濾  
import re  
soup.find(text=re.compile("sisters"))  
# u‘Once upon a time there were three little sisters; and their names were\n‘

獲取內容和字符串

獲取tag的字符串

title_tag.string  
# u‘The Dormouse‘s story‘

註意在實際使用中應該使用 unicode(title_tag.string)來轉換為純粹的string對象使用strings屬性會返回soup的構造1個叠代器，叠代tag對象下面的所有文本內容

for string in soup.strings:  
    print(repr(string))  
# u"The Dormouse‘s story"  
# u‘\n\n‘  
# u"The Dormouse‘s story"  
# u‘\n\n‘  
# u‘Once upon a time there were three little sisters; and their names were\n‘  
# u‘Elsie‘  
# u‘,\n‘  
# u‘Lacie‘  
# u‘ and\n‘  
# u‘Tillie‘  
# u‘;\nand they lived at the bottom of a well.‘  
# u‘\n\n‘  
# u‘...‘  
# u‘\n‘

獲取內容 .contents會以列表形式返回tag下的節點。

head_tag = soup.head  
head_tag  
# <head><title>The Dormouse‘s story</title></head>  
  
head_tag.contents  
[<title>The Dormouse‘s story</title>]  
  
title_tag = head_tag.contents[0]  
title_tag  
# <title>The Dormouse‘s story</title>  
title_tag.contents  
# [u‘The Dormouse‘s story‘]

想想，應該沒有什麽其他的了。。其他的也可以看文檔學習使用。總結其實使用起主要是

soup = BeatifulSoup(data)  
soup.title  
soup.p.[‘title‘]  
divs = soup.find_all(‘div‘, content=‘tpc_content‘)  
divs[0].contents[0].string

python下很帥氣的爬蟲包 - Beautiful Soup 示例

如何 lan linux下 csdn bottom 數量 ... 安裝包一個先發一下官方文檔地址。http://www.crummy.com/software/BeautifulSoup/bs4/doc/ 建議有時間可以看一下python包的文檔。 Beaut

Python學習筆記——pycharm 爬蟲：Beautiful soup

昨天看了看Beautiful soup，看的我真的是一臉懵逼，lxml的全忘光了，兩個光混淆。很難受一、安裝安裝Beautiful soup 和 lxml庫二、基本用法 # 資料來源 html = ''' <html>

Python 爬蟲利器 Beautiful Soup 4 之文件樹的搜尋

前面兩篇介紹的是 Beautiful Soup 4 的基本物件型別和文件樹的遍歷, 本篇介紹 Beautiful Soup 4 的文件搜尋搜尋文件樹主要使用兩個方法 find() 和 find_all() find_all(): find_all 是用於搜尋節

Python爬蟲之Beautiful Soup解析庫的使用（五）

Python爬蟲之Beautiful Soup解析庫的使用 Beautiful Soup-介紹 Python第三方庫，用於從HTML或XML中提取資料官方：http://www.crummv.com/software/BeautifulSoup/ 安裝：pip install beautifulsoup4

python爬蟲入門--Beautiful Soup庫介紹及例項

整理自：北理工嵩天老師的網路課程。 1、Beautiful Soup庫基礎知識（1）Beautiful Soup庫的理解 Beautiful Soup庫是解析、遍歷、維護“標籤樹”的功能庫。 BeautifulSoup對應一個HTML/XML文件的全部內容。

Python爬蟲(4):Beautiful Soup的常用方法

Requests庫的用法大家肯定已經熟練掌握了，但是當我們使用Requests獲取到網頁的 HTML 程式碼資訊後，我們要怎樣才能抓取到我們想要的資訊呢？我相信大家肯定嘗試過很多辦法，比如字串的 find 方法，還有高階點的正則表示式。雖然正則可以匹配到我們需要的資訊，但是我相信大家在匹配某個字串一次一次嘗試

python爬蟲基礎:Beautiful Soup用法詳解

前言說到爬蟲,我們不得不提起Beautiful Soup這個爬蟲利器,Beautiful Soup是一個可以從HTML或XML

python爬蟲之Beautiful Soup基礎知識+例項

#python爬蟲之Beautiful Soup基礎知識 >Beautiful Soup是一個可以從HTML或XML檔案中提取資料的python庫。它能通過你喜歡的轉換器實現慣用的文件導航，查詢，修改文件的方式。需要注意的是，Beautiful Soup已經自動將輸入文件轉換為Unicode編碼，輸出文

【Python3 爬蟲】Beautiful Soup庫的使用

attrs mouse 爬蟲 image 結構定義正則表達式 ttr document 之前學習了正則表達式，但是發現如果用正則表達式寫網絡爬蟲，那是相當的復雜啊！於是就有了Beautiful Soup簡單來說，Beautiful Soup是python的一個庫，最主要

一起學爬蟲——使用Beautiful Soup爬取網頁！

要想學好爬蟲，必須把基礎打紮實，之前釋出了兩篇文章，分別是使用XPATH和requests爬取網頁，今天的文章是學習Beautiful Soup並通過一個例子來實現如何使用Beautiful Soup爬取網頁。什麼是Beautiful Soup Beautiful Soup是一款高效

一起學爬蟲——使用Beautiful Soup爬取網頁

要想學好爬蟲，必須把基礎打紮實，之前釋出了兩篇文章，分別是使用XPATH和requests爬取網頁，今天的文章是學習Beautiful Soup並通過一個例子來實現如何使用Beautiful Soup爬取網頁。什麼是Beautiful Soup Beautiful Soup是一款高效的Python網頁解析

python下如何安裝.whl包

我的是py3.6版本的，編輯器是Spyder，安裝.whl包方法就是：把.whl包下載好放入D盤，比如名字叫tensorflow-1.4.0-cp36-cp36m-win_amd64.whl，一個whl檔案，然後通過anaconda的終端prompt，輸入：pip insta

Windows環境下python爬蟲常用庫和工具的安裝（UrlLib、Re、Requests、Selenium、lxml、Beautiful Soup、PyQuery 、PyMySQL等等）

本文列出了使用python進行爬蟲時所需的常用庫和工具的安裝過程，基本上只有幾行命令列的功夫就可以搞定，還是十分簡單的。一、UrlLib 與 Re 這兩個庫是python的內建庫，若系統中已經成功安裝了python的話，這兩個庫一般是沒有什麼問題的。驗證開啟命令列，進入

小白學 Python 爬蟲（22）：解析庫 Beautiful Soup（下）

人生苦短，我用 Python 前文傳送門：小白學 Python 爬蟲（1）：開篇小白學 Python 爬蟲（2）：前置準備（一）基本類庫的安裝小白學 Python 爬蟲（3）：前置準備（二）Linux基礎入門小白學 Python 爬蟲（4）：前置準備（三）Docker基礎入門小白學 Pyth

2017.08.11 Python網絡爬蟲實戰之Beautiful Soup爬蟲

文件的華僑定位 spa 文件目錄 lxml odi nco unicode 1.與Scrapy不同的是Beautiful Soup並不是一個框架，而是一個模塊；與Scrapy相比，bs4中間多了一道解析的過程（Scrapy是URL返回什麽數據，程序就接受什麽數據進行過濾

Python爬蟲系列（四）：Beautiful Soup解析HTML之把HTML轉成Python對象

調用 nor 結束版本現在 name屬性 data 官方文檔 get 在前幾篇文章，我們學會了如何獲取html文檔內容，就是從url下載網頁。今天開始，我們將討論如何將html轉成python對象，用python代碼對文檔進行分析。 (牛小妹在學校折騰了好幾天，也沒把h

Python爬蟲利器：Beautiful Soup

處理 previous tag 得到 navi log 簡單文本節點 pen Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Python庫。使用它來處理HTML頁面就像JavaScript代碼操作HTML DOM樹一樣方便。官方中文文檔地址 1

ubuntu下的python網頁解析庫的安裝——lxml, Beautiful Soup, pyquery, tesserocr

不同版本 utf-8 系統 pin dev sts one github html lxml 的安裝（xpath） pip3 install lxml 可能會缺少以下依賴： sudo apt-get install -y python3-dev build-e ssenti

【Python爬蟲學習實踐】基於Beautiful Soup的網站解析及數據可視化

為我 enc lambda ech 和我 find weather acc 節點在上一次的學習實踐中，我們以Tencent職位信息網站為例，介紹了在爬蟲中如何分析待解析的網站結構，同時也說明了利用Xpath和lxml解析網站的一般化流程。在本節的實踐中，我們將以中國天氣網

11月10日python爬蟲分析網頁的模組lxml和Beautiful Soup

unicode是字符集，不是編碼方式 ajax返回的是json字串，json字元是類字典的形式，裡面是鍵值對 format自動排列 # 定義檔案儲存的位置,原始的定義要改變的地方是定義在字串中的 fileName = 'g:/spider/poetry/poetry{0}.html' f

python下很帥氣的爬蟲包 - Beautiful Soup 示例

使用簡介

獲取指定tag的內容

獲取內容和字符串

相關推薦