簡單的python3 urllib3 多執行緒抓取圖片

阿新 • • 發佈：2019-02-13

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# @author: liukelin [email protected]
# 多執行緒抓取圖片
#
import urllib3
import string
import re
import os
import time
import threading

dir_ = os.getcwd()
def set_logs(msg,file=''):
logStr = "[%s]%s\r\n" %(time.strftime("%Y-%m-%d %H:%M:%S"),msg)
file = file if file!='' else dir_
f = open('%s/logs.log' % file,'a')
f.write(logStr)
f.close()

def get_img(url, begin_page, end_page, dir , threadNum , threadNo=None):

set_logs('開始抓取:%s.' % threadNo)

#建立連線特定主機的連線池
# http_pool = urllib3.HTTPConnectionPool('wanimal1983.tumblr.com')
http = urllib3.PoolManager()

for i in range(begin_page, end_page + 1):

if threadNo:
if i%threadNum != threadNo:
continue

findNum = 0 #匹配圖片

dowNum = 0 #儲存圖片

# m = urllib.request.urlopen(url+str(i)).read()
# m = http_pool.urlopen('GET',url+str(i) ,redirect=False)
try:
r = http.request('GET', url+str(i))
except:
http = urllib3.PoolManager()
r = http.request('GET', url+str(i))

m = r.data
# print(m)

#建立目錄儲存每個網頁上的圖片
dirpath = dir
'''
dirname = str(i)
new_path = os.path.join(dirpath, dirname)
if not os.path.isdir(new_path):
os.makedirs(new_path)
'''
page_data = m.decode('UTF-8')
page_image = re.compile('<img src=\"(.+?)\"') #匹配img正則
for image in page_image.findall(page_data):

pattern = re.compile(r'^http://.*.jpg$') # 判斷刷選圖片
if pattern.match(image):
findNum += 1
try:
# print('start:')

image_name = image.split("/")[-1] # get img name
image_path = dirpath + '/'+ image_name

ret = 'fail'
if os.path.exists(image_path) == False:
# print ('1')
# image_data = urllib.request.urlopen(image).read()
m2 = http.request('GET', image)
image_data = m2.data

# print('2')
with open(image_path, 'wb') as image_file:
image_file.write(image_data)
image_file.close()

ret = 'ok'
dowNum += 1
# print('3')
# print("%s:%s %s" %(time.strftime("%Y-%m-%d %H:%M:%S"),image_name,ret))
except:
print('Download failed')

msg = "[%s]%s,查詢:%s,儲存:%s,thread:%s\r\n" %(time.strftime("%Y-%m-%d %H:%M:%S"),url+str(i),findNum,dowNum, threadNo)
set_logs(msg,dir_)
print(msg)

if __name__ == "__main__":
# 抓取網址
url = "http://wanimal1983.tumblr.com/page/"
# 儲存位置
dir_ = '/Users/liukelin/Desktop/WANIMAL2'
#statr page
begin_page = 1
# end page
end_page = 122
# 匯流排程數
threadNum = 5

threads = []
for i in range(0, threadNum):
t = threading.Thread( target=get_img,name='get_img' ,args=(url, begin_page, end_page, dir_ , threadNum , i ) )
threads.append(t)

for t in threads:
t.setDaemon(True)
t.start()

for t in threads:
t.join()

print('all over:%s' % time.strftime("%Y-%m-%d %H:%M:%S"))

簡單的python3 urllib3 多執行緒抓取圖片

簡單的python3 urllib3 多執行緒抓取圖片

用JAVA實現簡單爬蟲多執行緒抓取

python：多執行緒抓取西刺和快站高匿代理IP

python多執行緒抓取網頁內容並寫入MYSQL

Python requests 多執行緒抓取出現HTTPConnectionPool Max retires exceeded異常

資料探勘_多執行緒抓取

goLang 多執行緒抓取網頁資料

多執行緒爬取圖片（生產者-消費者模式）

多執行緒爬取圖片網(分類儲存到資料夾)

爬蟲記錄（4）——多執行緒爬取圖片並下載

Jsoup簡單例子2.0——多執行緒爬取網頁內的郵箱

【Python3爬蟲-爬圖片】多執行緒爬取中國國家地理全站美圖，多圖可以提高你的審美哦

python簡單爬蟲多執行緒爬取京東淘寶資訊教程

spider----利用多執行緒爬取51job案例

python3：多執行緒（threading，Tread）

【python3】多執行緒-執行緒同步

【python3】多執行緒-執行緒非同步（推薦使用）

簡單的BackGroundWorker多執行緒時時重新整理UI介面，並顯示進度

java最簡單粗暴講解多執行緒，還不趕緊上車！

使用python的requests、xpath和多執行緒爬取糗事百科的段子

簡單的python3 urllib3 多執行緒 抓取圖片

相關推薦

簡單的python3 urllib3 多執行緒抓取圖片