1. 程式人生 > >批量下載網站圖片的Python小工具(下)

批量下載網站圖片的Python小工具(下)

深度 amp ted 講解 ati online 工作 rul 進程池

引子

在 批量下載網站圖片的Python實用小工具 一文中,講解了開發一個Python小工具來實現網站圖片的並發批量拉取。不過那個工具僅限於特定網站的特定規則,本文將基於其代碼實現,開發一個更加通用的圖片下載工具。

通用版

思路

我們可以做成一個下載圖片資源的通用框架:

  1. 制定生成網頁資源的規則集合 PageRules;
  2. 根據 PageRules 抓取網站的網頁內容集合 PageContents;
  3. 制定從網頁內容集合 PageContents 獲取資源真實地址的規則集合或路徑集合 ResourceRules ;
  4. 根據資源規則集合批量獲取資源的真實地址 ResourceTrulyAddresses ;
  5. 根據資源真實地址 ResourceTrulyAddresses 批量下載資源。

想象一條流水線:

初始 URLS --> 替換規則 --> 生成更多 URLS --> 抓取網頁內容 --> 獲取指定鏈接元素 A --> 中間 URLS -->  抓取網頁內容 -->   獲取指定鏈接元素  B -->  最終的圖片源地址集合 C --> 下載圖片

稱 [A,B,C] 是找到圖片源地址的規則路徑。 其中 A, B 通常是 <a href="xxx" class="yyy"> , C 通常是 <
img src="xxx.jpg" />

這裏的 URLS 不一定是 .html 後綴,但通常是 html 文檔,因此是什麽後綴並不影響抓取網頁內容。

為了使得圖片下載更加通用,通用版做了如下工作:

  • 將線程池和進程池抽離出來,做出可復用的基礎組件,便於在各個環節和不同腳本裏復用;
  • 進一步拆分操作粒度,將 "抓取網頁內容" 與 "根據規則從網頁內容中獲取鏈接元素" 分離出來成為單一的操作;
  • 在單個操作的基礎上提供批量操作,而這些批量操作是可以並發或並行完成的; 使用 map 語法使表達更加凝練;
  • 提供命令行參數的解析和執行;

值得提及的一點是,盡可能將操作粒度細化,形成可復用操作,對提供靈活多樣的功能選項是非常有益的。比如說將下載圖片操作做成一個單一函數,就可以提供一個選項,專門下載從文件中讀取的圖片地址資源集合; 將獲取初始Url 做成一個單一函數,就可以提供一個選項,專門從文件中讀取初始URL資源;將"根據規則從網頁內容中獲取鏈接元素" 做成一個單一函數,就可以將規則抽離出來,以命令參數的方式傳入; 此外,可以在單一函數中兼容處理不同的情況,比如說,獲取絕對資源鏈接地址,getAbsLink 就可以隔離處理相對路徑的鏈接,絕對路徑的鏈接,不合要求的鏈接等。

代碼

#!/usr/bin/python
#_*_encoding:utf-8_*_

import os
import re
import sys
import json
from multiprocessing import (cpu_count, Pool)
from multiprocessing.dummy import Pool as ThreadPool

import argparse
import requests
from bs4 import BeautifulSoup

ncpus = cpu_count()
saveDir = os.environ['HOME'] + '/joy/pic/test'

def parseArgs():
    description = '''This program is used to batch download pictures from specified urls.
                     eg python dwloadpics_general.py -u http://xxx.html -g 1 10 _p -r '[{"img":["jpg"]}, {"class":["picLink"]}, {"id": ["HidenDataArea"]}]'
                     will search and download pictures from network urls http://xxx_p[1-10].html  by specified rulePath
                  '''
    parser = argparse.ArgumentParser(description=description)
    parser.add_argument('-u','--url', nargs='+', help='At least one html urls are required', required=True)
    parser.add_argument('-g','--generate',nargs=2, help='Given range containing two number (start end) to generate more htmls if not empty ', required=False)
    parser.add_argument('-r','--rulepath',nargs=1,help='rule path to search pictures. if not given, search pictures in given urls', required=False)
    args = parser.parse_args()
    init_urls = args.url
    gene = args.generate
    rulepath = args.rulepath
    return (init_urls, gene, rulepath)

def createDir(dirName):
    if not os.path.exists(dirName):
        os.makedirs(dirName)

def catchExc(func):
    def _deco(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            print "error catch exception for %s (%s, %s): %s" % (func.__name__, str(*args), str(**kwargs), e)
            return None
    return _deco

class IoTaskThreadPool(object):
    '''
       thread pool for io operations
    '''
    def __init__(self, poolsize):
        self.ioPool = ThreadPool(poolsize)

    def execTasks(self, ioFunc, ioParams):
        if not ioParams or len(ioParams) == 0:
            return []
        return self.ioPool.map(ioFunc, ioParams)

    def execTasksAsync(self, ioFunc, ioParams):
        if not ioParams or len(ioParams) == 0:
            return []
        self.ioPool.map_async(ioFunc, ioParams)

    def close(self):
        self.ioPool.close()

    def join(self):
        self.ioPool.join()

class TaskProcessPool():
    '''
       process pool for cpu operations or task assignment
    '''
    def __init__(self):
        self.taskPool = Pool(processes=ncpus)

    def addDownloadTask(self, entryUrls):
        self.taskPool.map_async(downloadAllForAPage, entryUrls)

    def close(self):
        self.taskPool.close()

    def join(self):
        self.taskPool.join()

def getHTMLContentFromUrl(url):
    '''
       get html content from html url
    '''
    r = requests.get(url)
    status = r.status_code
    if status != 200:
        return ''
    return r.text

def batchGrapHtmlContents(urls):
    '''
       batch get the html contents of urls
    '''
    global grapHtmlPool
    return grapHtmlPool.execTasks(getHTMLContentFromUrl, urls)

def getAbsLink(link):
    global serverDomain

    try:
        href = link.attrs['href']
        if href.startswith('/'):
            return serverDomain + href
        else:
            return href
    except:
        return ''

def getTrueImgLink(imglink):
    '''
    get the true address of image link:
        (1) the image link is http://img.zcool.cn/community/01a07057d1c2a40000018c1b5b0ae6.jpg@900w_1l_2o_100sh.jpg
            but the better link is http://img.zcool.cn/community/01a07057d1c2a40000018c1b5b0ae6.jpg (removing what after @) 
        (2) the image link is relative path /path/to/xxx.jpg
            then the true link is serverDomain/path/to/xxx.jpg serverDomain is http://somedomain
    '''

    global serverDomain
    try:
        href = imglink.attrs['src']
        if href.startswith('/'):
            href = serverDomain + href
        pos = href.find('jpg@')
        if pos == -1:
            return href
        return href[0: pos+3] 
    except:
        return ''

def batchGetImgTrueLink(imgLinks):
    hrefs = map(getTrueImgLink, imgLinks)
    return filter(lambda x: x!='', hrefs)

def findWantedLinks(htmlcontent, rule):
    '''
       find html links or pic links from html by rule.
       sub rules such as:
          (1) a link with id=[value1,value2,...]
          (2) a link with class=[value1,value2,...]
          (3) img with src=xxx.jpg|png|...
       a rule is map containing sub rule such as:
          { 'id': [id1, id2, ..., idn] } or
          { 'class': [c1, c2, ..., cn] } or
          { 'img': ['jpg', 'png', ... ]}

    '''

    soup = BeautifulSoup(htmlcontent, "lxml")
    alinks = []
    imglinks = []

    for (key, values) in rule.iteritems():
        if key == 'id':
            for id in values:
                links = soup.find_all('a', id=id)
                links = map(getAbsLink, links)
                links = filter(lambda x: x !='', links)
                alinks.extend(links)
        elif key == 'class':
            for cls in values:
                if cls == '*':
                    links = soup.find_all('a')
                else:    
                    links = soup.find_all('a', class_=cls)
                links = map(getAbsLink, links)
                links = filter(lambda x: x !='', links)
                alinks.extend(links)        
        elif key == 'img':
            for picSuffix in values:
                imglinks.extend(soup.find_all('img', src=re.compile(picSuffix)))

    allLinks = []
    allLinks.extend(alinks)
    allLinks.extend(batchGetImgTrueLink(imglinks))
    return allLinks

def batchGetLinksByRule(htmlcontentList, rule):
    '''
       find all html links or pic links from html content list by rule
    '''

    links = []
    for htmlcontent in htmlcontentList:
        links.extend(findWantedLinks(htmlcontent, rule))
    return links

def defineResRulePath():
    '''
        return the rule path from init htmls to the origin addresses of pics
        if we find the origin addresses of pics by
        init htmls --> grap htmlcontents --> rules1 --> intermediate htmls
           --> grap htmlcontents --> rules2 --> intermediate htmls
           --> grap htmlcontents --> rules3 --> origin addresses of pics
        we say the rulepath is [rules1, rules2, rules3]
    '''
    return []

def findOriginAddressesByRulePath(initUrls, rulePath):
    '''
       find Origin Addresses of pics by rulePath started from initUrls
    '''
    result = initUrls[:]
    for rule in rulePath:
        htmlContents = batchGrapHtmlContents(result)
        links = batchGetLinksByRule(htmlContents, rule)
        result = []
        result.extend(links)
        result = filter(lambda link: link.startswith('http://'),result)    
    return result

def downloadFromUrls(initUrls, rulePath):
    global dwPicPool
    picOriginAddresses = findOriginAddressesByRulePath(initUrls, rulePath)
    dwPicPool.execTasksAsync(downloadPic, picOriginAddresses)

@catchExc
def downloadPic(picsrc):
    '''
       download pic from pic href such as
            http://img.pconline.com.cn/images/upload/upc/tx/photoblog/1610/21/c9/28691979_1477032141707.jpg
    '''

    picname = picsrc.rsplit('/',1)[1]
    saveFile = saveDir + '/' + picname

    picr = requests.get(picsrc, stream=True)
    with open(saveFile, 'wb') as f:
        for chunk in picr.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
                f.flush()
    f.close()

def divideNParts(total, N):
    '''
       divide [0, total) into N parts:
        return [(0, total/N), (total/N, 2M/N), ((N-1)*total/N, total)]
    '''

    each = total / N
    parts = []
    for index in range(N):
        begin = index*each
        if index == N-1:
            end = total
        else:
            end = begin + each
        parts.append((begin, end))
    return parts

def testBatchGetLinks():
    urls = ['http://dp.pconline.com.cn/list/all_t145.html', 'http://dp.pconline.com.cn/list/all_t292.html']
    htmlcontentList = map(getHTMLContentFromUrl, urls)
    rules = {'class':['picLink'], 'id': ['HidenDataArea'], 'img':['jpg']}
    allLinks = batchGetLinksByRule(htmlcontentList, rules)
    for link in allLinks:
        print link

def generateMoreInitUrls(init_urls, gene):
    '''
      Generate more initial urls using init_urls and a range specified by gene
      to generate urls, we give a base url containing a placeholder, then replace placeholder with number.
       eg. 
       base url:  http://xxx.yyy?k1=v1&k2=v2&page=placeholder -> http://xxx.yyy?k1=v1&k2=v2&page=[start-end]
       base url is specified by -u option if -g is given.
    '''

    if not gene:
        return init_urls

    start = int(gene[0])
    end = int(gene[1])
    truerange = map(lambda x: x+start, range(end-start+1))
    resultUrls = []
    for ind in truerange:
        for url in init_urls:
            resultUrls.append(url.replace('placeholder', str(ind)))
    return resultUrls

def parseRulePathParam(rulepathjson):
    rulepath = [{'img': ['jpg', 'png']}]
    if rulepathjson:
        try:
            rulepath = json.loads(rulepathjson[0])   
        except ValueError as e:
            print 'Param Error: invalid rulepath %s %s' % (rulepathjson, e)
            sys.exit(1) 
    return rulepath

def parseServerDomain(url):
    parts = url.split('/',3)
    return parts[0] + '//' + parts[2]


if __name__ == '__main__':

    #testBatchGetLinks()

    (init_urls, gene, rulepathjson) = parseArgs()
    moreInitUrls = generateMoreInitUrls(init_urls, gene)
    print moreInitUrls
    rulepath = parseRulePathParam(rulepathjson)
    serverDomain = parseServerDomain(init_urls[0])

    createDir(saveDir)

    grapHtmlPool = IoTaskThreadPool(20)
    dwPicPool = IoTaskThreadPool(20)

    downloadFromUrls(moreInitUrls, rulepath)
    dwPicPool.close()
    dwPicPool.join()

用法

有一個 Shell 控制臺或終端模擬器,安裝了 Python2.7, easy_install (pip), argparse, requests, bs4, BeautifulSoup

a. 當前頁面已經包含了美圖的真實地址,直接下載當前頁面的所有美圖,比如 http://bbs.voc.com.cn/topic-7477222-1-1.html , 可以使用

python dwloadpics_general.py -u http://bbs.voc.com.cn/topic-7477222-1-1.html

輕易地將該頁的美圖都下載下來;

b. 當前頁面包含了是一系列美圖的縮小圖,指向包含美圖真實地址的頁面。比如 http://www.zcool.com.cn/works/33!35!!0!0!200!1!1!!!/ 打開時會呈現出一系列美圖以及高清圖片的鏈接。在控制臺查看鏈接的 className = "image-link" 最終高清圖片是 技術分享圖片。那麽找到圖片真實地址的規則路徑是: [{"class":["image-link"]}, {"img":["jpg"]}] , 那麽命令行是:

python dwloadpics_general.py -u ‘http://www.zcool.com.cn/works/33!35!!0!0!200!1!1!!!/‘ -r ‘[{"class":["image-link"]}, {"img":["jpg"]}]‘

這裏使用了單引號,將 Shell 特殊字符 !, 空格, 等轉義或變成普通字符。

c. 多頁拉取

假設我們對這個風光系列很感興趣,那麽可以將所有頁的圖片都批量下載下來。怎麽做呢? 首先可以分析,

第一頁是 http://www.zcool.com.cn/works/33!35!!0!0!200!1!1!!! ,  第5頁是 http://www.zcool.com.cn/works/33!35!!0!0!200!1!5!!! ; 以此類推, 第 i 頁是 http://www.zcool.com.cn/works/33!35!!0!0!200!1!i!!! , 只要生成初始 urls = http://www.zcool.com.cn/works/33!35!!0!0!200!1[1-N]!!! 即可。這時候 -g 選項就派上用場啦!

python dwloadpics_general.py -u ‘http://www.zcool.com.cn/works/33!35!!0!0!200!1!placeholder!!!‘ -r ‘[{"class":["image-link"]}, {"img":["jpg"]}]‘ -g 1 2

-u 傳入基礎 url : http://www.zcool.com.cn/works/33!35!!0!0!200!1!placeholder!!! , -g 生成指定範圍之間的數字 i 並替換 placeholder, 就可以拼接成目標 url 了。

終極殺手版

思路

技術人的追求是永無止境的,盡管未必與產品和業務同學的觀點一致。^_^

即使通用版,也有一些不方便的地方:

(1) 必須熟悉定義的規則路徑,甚至需要了解一些CSS知識,對於普通人來說,實在難以理解;

(2) 對於千變萬化的網站圖片存儲規則,僅從部分網站提取的規律並不能有效地推廣。

因此,我又萌生一個想法: 做一個終極殺手版。

一切皆是鏈接。或取之,或舍之。真理的形式總是如此簡潔美妙。但取舍之間,卻是大智慧的體現。真理的內容又是如此錯綜復雜。

如果互聯網是相通的一張巨大蜘蛛網,那麽從一點出發,使用廣度遍歷算法,一定可以抵達每一個角落。通用版實際是指定了直達目標的一條路徑。從另一個角度來說,實際上只要給定一個初始URL鏈接, 遞歸地獲取URL鏈接,分析鏈接內容獲得URL鏈接,提取 img 圖片元素即可。

為此,定義另一種參數: loop 回合,或者深度。 假設從初始URL init_url 出發,經過 init_url -> mid_1 url -> mid_2 url -> origin address of pic OAOP,那麽 loop = 3. 也就是說,從 init_url 獲取鏈接 mid_1 url ,從 mid_1 url 的文檔的內容中獲取鏈接 mid_2 url ,從 mid_2 url 的文檔內容中獲取圖片的真實地址 OAOP,那麽,稱作進行了三個回合。類似於交通中的轉車一樣。這樣,用戶就不需要知道控制臺,Class,規則路徑之類的東東了。

現在的重點是取舍之道。取相對簡單,只要獲取鏈接元素即可,舍即是大道。對於鏈接元素來說,從一個網頁可以鏈接到任意網頁,如果不加限制,就會陷入失控。因此,定義了白名單網站,只獲取白名單網站內的鏈接;對於圖片元素來說,經過對幾個主流網站的查看,發現最終圖片基本采用 jpg 格式。而我們的目標是高清圖片,那麽對大小也是有要求的。可以定義大小的參數,讓用戶選擇。更智能的,通過圖片內容來分析是否是所需圖片,恕才疏學淺,暫難辦到。

現在打開 http://dp.pconline.com.cn/list/all_t145_p1.html , 只要使用

python dwloadpics_killer.py -u ‘http://dp.pconline.com.cn/list/all_t145_p1.html‘ -l 3

就能下載到大量美圖啦! loop 值越大,抓取網頁的範圍就越大, 所需流量也越大, 要慎用哦! So Crazy !
 

代碼

#!/usr/bin/python
#_*_encoding:utf-8_*_

import os
import re
import sys
import json
from multiprocessing import (cpu_count, Pool)
from multiprocessing.dummy import Pool as ThreadPool

import argparse
import requests
from bs4 import BeautifulSoup
import Image

ncpus = cpu_count()
saveDir = os.environ['HOME'] + '/joy/pic/test'
whitelist = ['pconline', 'zcool', 'huaban', 'taobao', 'voc']

DEFAULT_LOOPS = 1
DEFAULT_WIDTH = 800
DEFAULT_HEIGHT = 600

def isInWhiteList(url):
    for d in whitelist:
        if d in url:
            return True
    return False    


def parseArgs():
    description = '''This program is used to batch download pictures from specified initial url.
                     eg python dwloadpics_killer.py -u init_url
                  '''   
    parser = argparse.ArgumentParser(description=description)
    parser.add_argument('-u','--url', help='One initial url is required', required=True)
    parser.add_argument('-l','--loop', help='download url depth')
    parser.add_argument('-s','--size', nargs=2, help='specify expected size that should be at least, (with,height) ')
    args = parser.parse_args()
    init_url = args.url
    size = args.size
    loops = int(args.loop)
    if loops is None:
        loops = DEFAULT_LOOPS
    if size is None:
        size = [DEFAULT_WIDTH, DEFAULT_HEIGHT]
    return (init_url,loops, size)

def createDir(dirName):
    if not os.path.exists(dirName):
        os.makedirs(dirName)

def catchExc(func):
    def _deco(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            print "error catch exception for %s (%s, %s): %s" % (func.__name__, str(*args), str(**kwargs), e)
            return None
    return _deco

class IoTaskThreadPool(object):
    '''
       thread pool for io operations
    '''
    def __init__(self, poolsize):
        self.ioPool = ThreadPool(poolsize)

    def execTasks(self, ioFunc, ioParams):
        if not ioParams or len(ioParams) == 0:
            return []
        return self.ioPool.map(ioFunc, ioParams)

    def execTasksAsync(self, ioFunc, ioParams):
        if not ioParams or len(ioParams) == 0:
            return []
        self.ioPool.map_async(ioFunc, ioParams)

    def close(self):
        self.ioPool.close()

    def join(self):
        self.ioPool.join()

class TaskProcessPool():
    '''
       process pool for cpu operations or task assignment
    '''
    def __init__(self):
        self.taskPool = Pool(processes=ncpus)

    def addDownloadTask(self, entryUrls):
        self.taskPool.map_async(downloadAllForAPage, entryUrls)

    def close(self):
        self.taskPool.close()

    def join(self):
        self.taskPool.join()

def getHTMLContentFromUrl(url):
    '''
       get html content from html url
    '''
    r = requests.get(url)
    status = r.status_code
    if status != 200:
        return ''
    return r.text

def batchGrapHtmlContents(urls):
    '''
       batch get the html contents of urls
    '''
    global grapHtmlPool
    return grapHtmlPool.execTasks(getHTMLContentFromUrl, urls)

def getAbsLink(link):
    global serverDomain

    try:
        href = link.attrs['href']
        if href.startswith('//'):
            return 'http:' + href
        if href.startswith('/'):
            return serverDomain + href
        if href.startswith('http://'):
            return href
        return ''
    except:
        return ''

def filterLink(link):
    '''
       only search for pictures in websites specified in the whitelist 
    '''
    if link == '':
        return False
    if not link.startswith('http://'):
        return False
    serverDomain = parseServerDomain(link)
    if not isInWhiteList(serverDomain):
        return False
    return True

def filterImgLink(imgLink):
    '''
       The true imge addresses always ends with .jpg
    '''
    commonFilterPassed = filterLink(imgLink)
    if commonFilterPassed:
        return imgLink.endswith('.jpg')

def getTrueImgLink(imglink):
    '''
    get the true address of image link:
        (1) the image link is http://img.zcool.cn/community/01a07057d1c2a40000018c1b5b0ae6.jpg@900w_1l_2o_100sh.jpg
            but the better link is http://img.zcool.cn/community/01a07057d1c2a40000018c1b5b0ae6.jpg (removing what after @) 
        (2) the image link is relative path /path/to/xxx.jpg
            then the true link is serverDomain/path/to/xxx.jpg serverDomain is http://somedomain
    '''

    global serverDomain
    try:
        href = imglink.attrs['src']
        if href.startswith('/'):
            href = serverDomain + href
        pos = href.find('jpg@')
        if pos == -1:
            return href
        return href[0: pos+3] 
    except:
        return ''

def findAllLinks(htmlcontent, linktag):
    '''
       find html links or pic links from html by rule.
    '''
    soup = BeautifulSoup(htmlcontent, "lxml")
    if linktag == 'a':
        applylink = getAbsLink
    else:
        applylink = getTrueImgLink
    alinks = soup.find_all(linktag)
    allLinks = map(applylink, alinks)
    return filter(lambda x: x!='', allLinks)

def findAllALinks(htmlcontent):
    return findAllLinks(htmlcontent, 'a')

def findAllImgLinks(htmlcontent):
    return findAllLinks(htmlcontent, 'img')

def flat(listOfList):
    return [val for sublist in listOfList for val in sublist]

@catchExc
def downloadPic(picsrc):
    '''
       download pic from pic href such as
            http://img.pconline.com.cn/images/upload/upc/tx/photoblog/1610/21/c9/28691979_1477032141707.jpg
    '''

    picname = picsrc.rsplit('/',1)[1]
    saveFile = saveDir + '/' + picname

    picr = requests.get(picsrc, stream=True)
    with open(saveFile, 'wb') as f:
        for chunk in picr.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
                f.flush()
    f.close()
    return saveFile

@catchExc
def removeFileNotExpected(filename):
    global size

    expectedWidth = size[0]
    expectedHeight = size[1]
    img = Image.open(filename)
    imgsize = img.size
    if imgsize[0] < expectedWidth or imgsize[1] < expectedHeight: 
       os.remove(filename) 

def downloadAndCheckPic(picsrc):
    saveFile = downloadPic(picsrc)
    removeFileNotExpected(saveFile)

def batchDownloadPics(imgAddresses):
    global dwPicPool
    dwPicPool.execTasksAsync(downloadAndCheckPic, imgAddresses)

def downloadFromUrls(urls, loops):
    htmlcontents = batchGrapHtmlContents(urls)
    allALinks = flat(map(findAllALinks, htmlcontents))
    allALinks = filter(filterLink, allALinks)
    if loops == 1:
        allImgLinks = flat(map(findAllImgLinks, htmlcontents))
        validImgAddresses = filter(filterImgLink, allImgLinks)
        batchDownloadPics(validImgAddresses)
    return allALinks

def startDownload(init_url, loops=3):
    '''
       if init_url -> mid_1 url -> mid_2 url -> true image address
       then loops = 3 ; default loops = 3
    '''
    urls = [init_url]
    while True:
        urls = downloadFromUrls(urls, loops) 
        loops -= 1
        if loops == 0:
            break

def divideNParts(total, N):
    '''
       divide [0, total) into N parts:
        return [(0, total/N), (total/N, 2M/N), ((N-1)*total/N, total)]
    '''

    each = total / N
    parts = []
    for index in range(N):
        begin = index*each
        if index == N-1:
            end = total
        else:
            end = begin + each
        parts.append((begin, end))
    return parts

def parseServerDomain(url):
    parts = url.split('/',3)
    return parts[0] + '//' + parts[2]

if __name__ == '__main__':

    (init_url,loops, size) = parseArgs()
    serverDomain = parseServerDomain(init_url)

    createDir(saveDir)

    grapHtmlPool = IoTaskThreadPool(10)
    dwPicPool = IoTaskThreadPool(10)

    startDownload(init_url, loops)
    dwPicPool.close()
    dwPicPool.join()

   

小結

通過一個針對特定目標網站的批量圖片下載工具的實現,從一個串行版本改造成一個並發的更加通用的版本,學到了如下經驗:

  • 將線程池、進程池、任務分配等基礎組件通用化,才能在後續更省力地編寫程序,不必一次次寫重復代碼;

  • 更加通用可擴展的程序,需要更小粒度更可復用的單一微操作;

  • 需要能夠分離變量和不變量, 並敏感地意識到可能的變量以及容納的方案;

  • 通過尋找規律,提煉規則,並將規則使用數據結構可配置化,從而使得工具更加通用;

  • 通過探究本質,可以達到更加簡潔有效的思路和實現;

實際上,圖片網站的規則可謂千變萬化,針對某個或某些網站提煉的規則對於其他網站不一定有效; 如果要做成更強大通用的圖片下載器,則需要對主流網站的圖片存放及鏈接方式做一番調研,歸納出諸多規則集合,然後集中做成規則匹配引擎,甚至是更智能的圖片下載工具。不過,對於個人日常使用來說,只要能順利下載比較喜歡的網站的圖片,逐步增強獲取圖片真實地址的規則集合,也是可以滴 ~~

 本文原創, 轉載請註明出處,謝謝! :)

批量下載網站圖片的Python小工具(下)