初探：Python中使用request和BeautifulSoup庫進行網路爬蟲

阿新 • • 發佈：2018-11-09

說起網路爬蟲，Python中最底層的應該是urllib，但是語法結構有些繁瑣，需要使用正則。而使用request和BeautifulSoup庫進行網路爬蟲，發現這真的是web開發人員的福音。凡是懂一些前端知識的人來說，使用request和BeautifulSoup庫進行爬蟲，真的有一種開心而愉快的感覺。

requests 主要是一個封裝好了http功能的庫，可以實現基本的http操作。

beautifulsoup 主要提供了對html, xml網頁的一個完美的解析方式，實際上，他將html中的tag 作為樹節點進行解析，於是，我們可以將一個html頁面看成是一顆樹結構。也就是利用DOM（Document Object Model）來進行內容的抓取。

獲得網頁原始碼：

import requests
res = requests.get('https://www.sina.com.cn/')
res.encoding = 'utf-8'
print(res.text)

獲得需要的內容：

# 獲得需要的內容
from bs4 import BeautifulSoup
html_sample = res.text
soup = BeautifulSoup(html_sample,'html.parser')
# print(soup.text)  #得到的是title標籤內的內容
# 使用select找出含有h1的元素
header = soup.select('h1')
print(header)  # 得到的的含有h1標籤的一個列表，要獲得單純的一個含h1的標籤，可使用header[0]，要獲得其中的文字，可使用下面。若有很多，可使用for迴圈
print(header[0].text)  # 提取其中的文字

seelct的使用：

如果使用select 找出所有id為title 的元素：alike = soup.select(‘#title’)
如果使用select 找出所有class為link 的元素：
soup = BeautifulSoup(html_sample)
for link in soup.select(‘.link’):
    print(link)

例子：

使用select找出所有a tag的hrefl連結：
ainks = soup.select(‘a’)
for link in alinks:
    print(link[‘href’])
例子：
a = '<a href="#" qoo=123 abc=456>I am a link</a>'
soup2 = BeautifulSoup(a)
alinks = soup2.select('a')[0]
print(alinks['href'])

一個簡單的抓取“糗事百科”內容的例子：

import requests
from bs4 import BeautifulSoup
content = requests.get('https://www.qiushibaike.com/').content
soup = BeautifulSoup(content,'html.parser')
story = soup.select('.content span')
for p in story:
    print(p.text)

以上的例子都是抓取的文字內容，並且其都在Doc中，有些網頁內容是在js或者XHR中，要具體問題具體分析。

此文只是requests和Beautifulsoup初探，後續會繼續更文。

初探：Python中使用request和BeautifulSoup庫進行網路爬蟲

獲得網頁原始碼：

獲得需要的內容：

seelct的使用：

例子：

一個簡單的抓取“糗事百科”內容的例子：

初探：Python中使用request和BeautifulSoup庫進行網路爬蟲

【譯】：python中的colorlog庫

Python：python3類-安裝和使用庫-讀寫檔案

Python爬蟲（urllib.request和BeautifulSoup）

python語言用requests庫和BeautifulSoup庫爬取京東商品資訊

轉發：python中的網頁爬取函式requests.get（）和urlopen函式的區別

資料爬蟲（三）：python中requests庫使用方法詳解

Python：Python 中 jieba 庫的使用（中文分詞）

Windows下python安裝easy_install和pip&安裝requests和BeautifulSoup庫

由淺入深：Python 中如何實現自動匯入缺失的庫？

Python中sort()和sorted()的區別

Python中range和xrange的異同之處

php學習之道：php中is_file和file_exist的差別,and推斷文件夾is_dir

Python開發【第六篇】：Python基礎條件和循環

大話Python中*args和**kargs的使用

python中xrange和range（轉）

python中編碼和解碼decode和encode的使用

22：python中的循環控制語句

19：python中的判斷語句

20：python中的循環語句

初探：Python中使用request和BeautifulSoup庫進行網路爬蟲

獲得網頁原始碼：

獲得需要的內容：

seelct的使用：

例子：

一個簡單的抓取“糗事百科”內容的例子：

相關推薦