python——圖片爬蟲：爬取愛女神網站(www.znzhi.net)上的妹子圖進階篇

阿新 • • 發佈：2019-02-07

我講解了圖片爬蟲的基本步驟，並實現了爬蟲程式碼

在本篇中，我將帶領大家對基礎篇中的程式碼進行改善，加入多執行緒，提高爬取效率。

首先我們明確一個改進的思路，就是在函式downloadAlbum(url)中：

# 迴圈下載專輯中各個圖片  
    for num in range(1, pic_count+1):  
        pic_url = url+"/"+str(num)+'.html'  
        i=0  
        while downloadOnePic(path, pic_url) == False:  
            print 'redownload'  
            i=i+1  
            if i > 5:  
                print "timeout's time too much"  
                break

將for迴圈體中，呼叫downloadOnePic()的操作放到子執行緒中，然後迴圈建立子執行緒，即有n個圖片，迴圈建立n個子執行緒，相當於n個圖片同時處於下載狀態

一、新建子執行緒類

# 迴圈下載專輯中各個圖片  
    for num in range(1, pic_count+1):  
        pic_url = url+"/"+str(num)+'.html'  
        i=0  
        while downloadOnePic(path, pic_url) == False:  
            print 'redownload'  
            i=i+1  
            if i > 5:  
                print "timeout's time too much"  
                break

二、迴圈建立子執行緒

# 迴圈下載專輯中各個圖片，對每個圖片的下載各開啟一個執行緒
    for num in range(1, pic_count+1):
        pic_url = url+"/"+str(num)+'.html'
        # 建立新的子執行緒
        threadD = threadDownload(path,pic_url)
        # 開啟子執行緒
        threadD.start()
    # 在主執行緒中迴圈查詢當前正在活動的執行緒數量
    while threading.active_count() != 0:
        # 當正在活動的執行緒數量為1，即只剩主執行緒時，表示所有子執行緒都已關閉，即所有圖片下載完畢
        if threading.active_count() == 1:
            print '  all pic has downloaded of this page:' + url
            return True

在原本for迴圈體呼叫downloadOnePic()的地方改為建立新的子執行緒，並啟動

然後用while迴圈查詢當前正在活動的執行緒數量，並在迴圈體中判斷當前的執行緒數量，當活動執行緒數量等於1，即只剩主執行緒時，表示所有子執行緒已關閉，所有圖片下載完畢，此時return退出

此爬蟲已經改進完成，通過使用多執行緒提高爬取效率。若還有其他可改進的地方，歡迎交流學習。

完整程式碼如下：

# coding:utf8
# python環境2.7
# 爬取網站：http://www.znzhi.net/
# author：CodeZ

import os
import re
import threading
import urllib2

import time

from bs4 import BeautifulSoup

BASE_PATH = 'picture'
HOST_HOT = 'http://www.znzhi.net/hot'
HOST_ALBUM = 'http://www.znzhi.net/p'
MAX_PAGE_NUM = 387

class threadDownload(threading.Thread):
    def __init__(self,path,url):
        threading.Thread.__init__(self)
        # 路徑引數
        self.path = path
        # url引數
        self.url = url
    def run(self):
        i=0
        while downloadOnePic(self.path, self.url) == False:
            print 'redownload'
            i=i+1
            if i > 5:
                print "timeout's time too much"
                break

def downloadUrl(url):
    # 捕獲異常（超時）
    try:
        # 開啟網頁
        response = urllib2.urlopen(url, timeout=10)
        # 設定編碼方式
        response.encoding = 'utf-8'
        # 判斷http請求的狀態
        if response.getcode() == 200:
            # 狀態正常（200），返回頁面資料
            return response.read()
        else:
            # 失敗，列印訊息，返回空資料
            print "error:url visit failed"
            return ''
    except Exception, e:
        # 列印異常
        print "exception:"+e.message
        print "reopen"
        # 重新下載
        return downloadUrl(url)

# 獲取相簿名稱
def getPicName(picUrl) :
    # 擷取地址中最後一個/後面的字元，即圖片名
    picName = os.path.basename(picUrl)
    if '.jpg' in picName:
        return picName
    return 'error.jpg'
# 下載單張照片
def downloadOnePic(path,url):
    soup = BeautifulSoup(downloadUrl(url),
                         'html.parser',
                         from_encoding='utf-8')
    # 獲取存有img節點
    img_node = soup.find('div', class_='main-image').find('img')
    # 獲取img的src值，即圖片地址
    pic_url = img_node.get('src')
    # 呼叫getPicName()獲取圖片名稱
    pic_name = getPicName(pic_url).encode('utf-8')
    try:
        # 訪問圖片地址，獲取資料
        content = urllib2.urlopen(pic_url, timeout=10).read()
        # 儲存圖片到本地
        with open(path + '/' + pic_name, 'wb') as code:
            code.write(content)
        print '  -> ' + pic_name + " download success"
    #捕獲異常
    except Exception, e:
        print "exception:"+e.message
        return False
    return True

def downloadAlbum(url):
    print "album:"+url
    # 獲取當前頁面資料
    content = downloadUrl(url)
    # 傳入頁面資料content，建立beautifulsoup物件soup
    soup = BeautifulSoup(content,
                         'html.parser',
                         from_encoding='utf-8')
    # 獲取存有圖片專輯標題的h2標籤
    title = soup.find('div', class_='content').find('h2')
    # 檢查是否有內容，在實際爬取中，有遇到過空圖片專輯的情況，
    if title == None:
        print "error:web content has lost"
        return
    # 通過正則篩選出標題中含有的總圖片數值
    title_num = re.findall(r'\d+', title.get_text())
    pic_count = int(title_num[-1])
    # 將（1/num)擷取去除，並新增總圖片數 [num]
    title_split = title.get_text().split(' (', 1)
    album_title = title_split[0]+'['+str(pic_count)+']'
    # 刪去標題中的'/'字元，防止在用標題作為名稱建圖片資料夾時報錯
    album_title = album_title.replace('/', ' ')
    # 拼接本地資料夾路徑，並檢查路徑是否存在，防止重複下載
    path = BASE_PATH + "/" + album_title
    if os.path.exists(path):
        print '  -> ' + album_title + ' has exists'
        return True
    # 新建存放當前專輯的圖片資料夾
    checkDocuments(path)
    print path
    # 新建一個html，存有此圖片專輯相關資訊
    with open(path+'/source.html','w') as fout:
        fout.write("<html>")
        fout.write("<body>")
        fout.write("<p>"+album_title.encode('utf-8')+"-["+str(pic_count)+"p]"+"</p>")
        fout.write("<a href=\""+url.encode('utf-8')+"\">來源網址:"+url.encode('utf-8')+"</a>")
        fout.write("</body>")
        fout.write("</html>")
    # 迴圈下載專輯中各個圖片，對每個圖片的下載各開啟一個執行緒
    for num in range(1, pic_count+1):
        pic_url = url+"/"+str(num)+'.html'
        # 建立新的子執行緒
        threadD = threadDownload(path,pic_url)
        # 開啟子執行緒
        threadD.start()
    # 在主執行緒中迴圈查詢當前正在活動的執行緒數量
    while threading.active_count() != 0:
        # 當正在活動的執行緒數量為1，即只剩主執行緒時，表示所有子執行緒都已關閉，即所有圖片下載完畢
        if threading.active_count() == 1:
            print '  all pic has downloaded of this page:' + url
            return True

def downloadPage(url):
    print "page:"+url
    # 獲取當前頁面資料
    content = downloadUrl(url)
    # 傳入頁面資料content，建立beautifulsoup物件soup
    soup = BeautifulSoup(content,
                         'html.parser',
                         from_encoding='utf-8')
    # 獲取單頁中18個圖片專輯的父節點
    album_block = soup.find('ul', id='images')
    # 獲取父節點下圖片專輯地址的a節點集
    album_nodes = album_block.findAll('a', href=re.compile(r'http://www.znzhi.net/p/'))
    # 由於每個專輯的a標籤有兩個，用[::2]獲取a節點集中的偶數項，迴圈下載圖片專輯
    for album_node in album_nodes[::2]:
        # 呼叫downloadAlbum
        # 傳入album_node.get('href')獲取a節點的href值，即專輯地址
        downloadAlbum(album_node.get('href'))
        # 若執行中想終止爬蟲程式，可在同父目錄下新建stop.txt檔案
        if os.path.exists('stop.txt'):
            exit(0)
        # 設定圖片專輯下載間隙休眠，防止因訪問頻繁，被網站拉黑
        time.sleep(4)
# 檢查本地檔案路徑是否存在，不存在則建立
def checkDocuments(path):
    if os.path.exists(path) == False:
        os.mkdir(path)
# main函式
if __name__ == "__main__":
    # 檢查本地下載路徑是否存在
    checkDocuments(BASE_PATH)
    # 迴圈訪問
    for i in range(1, MAX_PAGE_NUM+1):
        # 拼接頁地址，格式為：http://www.znzhi.net/hot/頁碼.html
        page_url = HOST_HOT+'/'+str(i)+'.html'
        # 儲存當前頁碼，供檢視下載進度
        with open('cur_page.txt', 'w') as fpage:
            fpage.write(str(i))
        # 以頁為單位進行下載
        downloadPage(page_url)

python——圖片爬蟲：爬取愛女神網站(www.znzhi.net)上的妹子圖進階篇

我講解了圖片爬蟲的基本步驟，並實現了爬蟲程式碼在本篇中，我將帶領大家對基礎篇中的程式碼進行改善，加入多執行緒，提高爬取效率。首先我們明確一個改進的思路，就是在函式downloadAlbum(url)中： # 迴圈下載專輯中各個圖片 for num in

python簡單爬蟲：爬取並統計自己部落格頁面的資訊（一）

1. 什麼是爬蟲也叫網路爬蟲，簡單來說，爬蟲就是從一個根網站出發，根據某種規則獲得更多的相關網站的url，自動下載這些網頁並自動解析這些網頁的內容，從中獲取需要的資料。例如爬取某種圖片、某類文字資訊等。爬蟲還可以用於編纂搜尋引擎的網路索引。爬蟲所涉及的知

Python爬蟲：爬取指定網址圖片

import re import urllib.request def gethtml(url): page=urllib.request.urlopen(url) html=page.

python爬蟲：爬取網站視頻

爬蟲 python python爬取百思不得姐網站視頻：http://www.budejie.com/video/新建一個py文件，代碼如下：#!/usr/bin/python # -*- coding: UTF-8 -*- import urllib,re,requests import sys

Python開發爬蟲之BeautifulSoup解析網頁篇：爬取安居客網站上北京二手房數據

澳洲 pytho 目標 www. 委托 user info .get web 目標：爬取安居客網站上前10頁北京二手房的數據，包括二手房源的名稱、價格、幾室幾廳、大小、建造年份、聯系人、地址、標簽等。網址為：https://beijing.anjuke.com/sale/

Python網絡爬蟲：爬取古詩文中的某個制定詩句來實現搜索

它的參考文獻 lis 實現 word self 適合 odi 級別 python編譯練習，為了將自己學習過的知識用上，自己找了很多資料。所以想做一個簡單的爬蟲，代碼不會超過60行。主要用於爬取的古詩文網站沒有什麽限制而且網頁排布很規律，沒有什麽特別的東西，適合入門級別的

我的第一個python爬蟲：爬取豆瓣top250前100部電影

爬取豆瓣top250前100部電影 1 # -*-coding=UTF-8 -*- 2 3 import requests 4 from bs4 import BeautifulSoup 5 6 headers = {'User-Agent':'Moz

python爬蟲：爬取鏈家深圳全部二手房的詳細信息

data sts rip 二手房 lse area 列表 dom bubuko 1、問題描述：爬取鏈家深圳全部二手房的詳細信息，並將爬取的數據存儲到CSV文件中 2、思路分析: (1)目標網址：https://sz.lianjia.com/ershoufang/ (2

Python爬蟲：爬取網站電影資訊

以爬取電影天堂喜劇片前5頁資訊為例，程式碼如下： 1 # coding:UTF-8 2 3 import requests 4 import re 5 6 def mov(): 7 headers={'User-Agent':'Mozilla/5.0 (Windo

Python爬蟲：爬取拉勾網資料分析崗位資料

1 JSON介紹 JSON（JavaScript Object Notation）已經成為通過HTTP請求在Web瀏覽器和其他應用程式之間傳送資料的標準格式之一。比CSV格式更加靈活。Json資料格式，非常接近於有效的Pyhton程式碼，其特點是：JSON物件所

Python網路爬蟲（九）：爬取頂點小說網站全部小說，並存入MongoDB

前言：本篇部落格將爬取頂點小說網站全部小說、涉及到的問題有：Scrapy架構、斷點續傳問題、Mongodb資料庫相關操作。背景： Python版本：Anaconda3 執行平臺：Windows IDE：PyCharm 資料庫：MongoDB 瀏

爬蟲：爬取圖片並儲存在某路徑下

import re import urllib.request def getHtml(url): page=urllib.request.urlopen(url) html=page.read() return html def getImg(html):

爬蟲：爬取圖片並保存在某路徑下

page err space print ont quest erro += .html import re import urllib.request def getHtml(url): page=urllib.request.urlopen(url)

Python——網路爬蟲（爬取網頁圖片）

最近在學習 Python，然後就試著寫了一個簡單的Python小程式，爬取一個網頁的圖片，不得不說 Python 真的強大，以下是爬取 NEFU Online Judge 網站的程式碼。吐槽：其實

python爬蟲：爬取貓眼電影（分數的處理和多執行緒）

爬取用的庫是requests和beautifulsoup，程式碼編寫不難，主要是個別的細節處理需要注意 1、電影得分的處理右鍵審查元素，我們看到分數的整數部分和小數部分是分開的，在beautifulsoup中，我們可以用（.strings或者.stripped_stri

Python爬蟲：爬取微信文章

import requests from urllib.parse import urlencode from requests.exceptions import ConnectionError from pyquery import PyQuery as

python爬蟲：爬取豆瓣讀書某個tag下的書籍並存入excel

#-*- coding: UTF-8 -*- import sys import time import urllib import urllib2 import requests #import numpy as np from bs4 import BeautifulS

Python題目4：爬取電影

sts pip pytho 靜態網頁下載返回 link 編碼格式模塊 import re # 正則表達式，用於提取數據 import requests # 下載網頁源代碼 ‘‘‘ 安裝requests模塊：pip install requests 參考文檔：htt

Python題目5：爬取CFDA數據

get yun div ont header lac 函數信息 con import requests class Cfda: # 初始化函數 def __init__(self): # 初始化要提交數據的網址 self

網路爬蟲：爬取動態網頁

import requests from bs4 import BeautifulSoup res = requests.get('http://news.sina.com.cn/c/nd/2017-06-12/doc-ifyfzhac1650783.shtml') res.encoding = '

python——圖片爬蟲：爬取愛女神網站(www.znzhi.net)上的妹子圖 進階篇

相關推薦

python——圖片爬蟲：爬取愛女神網站(www.znzhi.net)上的妹子圖進階篇