Python爬蟲的步驟和工具

阿新 • • 發佈：2018-09-03

數據 raw 匹配 () 可能表達寫入封裝 ext

#四個步驟

1.查看crawl內容的源碼格式 crawl的內容可以是 url(鏈接），文字，圖片，視頻

2.請求網頁源碼　　　　　　　　（可能要設置）代理，限速，cookie

3.匹配　　　　　　　　　　　　用正則表達式匹配

4.保存數據　　　　　　　　　　文件操作

#兩個基本工具（庫）

1.urllib

2.requests

#使用reuests庫的一個例子，抓取可愛圖片

import requests #導入庫
import re
url =r‘https://www.woyaogexing.com/tupian/keai‘ #鏈接
response =requests.get(url) #get()函數，得到網頁
response.encoding =‘utf-8‘　　　　　　　　　　#讓源碼中的中文正常顯示
html =response.text　　　　　　　　　　　　　#加載網頁源碼
strs =‘<div class="txList_1 .">.*?src="(.*?)".*?>‘ #正則表達式
patern =re.compile(strs,re.S)　　　　　　　　　#封裝成對象，以便多次使用
items =re.findall(patern,html)　　　　　　　　　#匹配
for i in items:
with open(‘%d.jpg‘%items.index(i),‘wb‘) as file: #新建文件，以二進制寫形式‘wb‘
url =‘https:‘+i
file.write(requests.get(url).content)　　　　#寫入數據，圖片是二進制數據

Python爬蟲的步驟和工具

數據 raw 匹配 () 可能表達寫入封裝 ext #四個步驟 1.查看crawl內容的源碼格式 crawl的內容可以是 url(鏈接），文字，圖片，視頻 2.請求網頁源碼　　　　　　　　（可能要設置）代理，限速，cookie 3.匹配　　　　　　　

Python爬蟲的步驟和工具

Python爬蟲的步驟和工具

Python開發環境和工具介紹

python爬蟲 Get 和 Post的區別

Python爬蟲UrlError和HttpError系列之五

python爬蟲----handler和opener

linux系統學習1-7：裸機開發步驟和工具使用

Python爬蟲Selenium和PhantomJS系列之十三

Windows環境下python爬蟲常用庫和工具的安裝（UrlLib、Re、Requests、Selenium、lxml、Beautiful Soup、PyQuery 、PyMySQL等等）

python爬蟲（3）——python爬取大規模資料的的方法和步驟

python爬蟲用到的工具和類庫

Python 安裝setuptools和pip工具

python爬蟲——與不斷變化的頁面死磕和更新換代（3）

[Python爬蟲] 在Windows下安裝PhantomJS和CasperJS及入門介紹(上)

Mac OSX python多版本管理工具：pyenv 和 virtualenv搭建

python爬蟲(七)_urllib2：urlerror和httperror

python爬蟲之解析網頁的工具pyquery

Python 爬蟲常見的坑和解決方法

python 進程內存增長問題, 解決方法和工具

python爬蟲之線程池和進程池

python爬蟲和網絡營銷等場景下更換本地IP地址的幾種辦法

Python爬蟲的步驟和工具

相關推薦