[python學習] 簡單爬取圖片站點圖庫中圖片

阿新 • • 發佈：2017-07-28

ctu while 要去文章 ava ges file cor nal

近期老師讓學習Python與維基百科相關的知識，無聊之中用Python簡單做了個爬取“遊訊網圖庫”中的圖片，由於每次點擊下一張感覺很浪費時間又繁瑣。主要分享的是怎樣爬取HTML的知識和Python怎樣下載圖片。希望對大家有所幫助，同一時候發現該站點的圖片都挺精美的，建議閱讀原網下載圖片，支持遊訊網不要去破壞它。
通過瀏覽遊訊網發現它的圖庫URL為。當中所有圖片為0_0_1到0_0_75:
http://pic.yxdown.com/list/0_0_1.html
http://pic.yxdown.com/list/0_0_75.html
技術分享

同一時候通過下圖能夠發現遊訊網的1-75頁個列表，每頁中有非常多個主題。每一個主題都有對應的多張圖片。

技術分享源碼例如以下：
(需在本地創建E:\\Picture3目錄和Python執行目錄創建yxdown目錄)

# coding=utf-8
# 聲明編碼方式 默認編碼方式ASCII 參考https://www.python.org/dev/peps/pep-0263/
import urllib
import time
import re
import os

'''
Python下載遊迅網圖片 BY:Eastmount
'''

'''
**************************************************
#第一步 遍歷獲取每頁相應主題的URL
#http://pic.yxdown.com/list/0_0_1.html
#http://pic.yxdown.com/list/0_0_75.html
**************************************************
'''
fileurl=open('yxdown_url.txt','w')
fileurl.write('****************獲取遊訊網圖片URL*************\n\n') 
#建議num=3 while num<=3一次遍歷一個頁面全部主題,下次換成num=4 while num<=4而不是1-75 
num=3
while num<=3:
    temp = 'http://pic.yxdown.com/list/0_0_'+str(num)+'.html'
    content = urllib.urlopen(temp).read()
    open('yxdown_'+str(num)+'.html','w+').write(content)
    print temp
    fileurl.write('****************第'+str(num)+'頁*************\n\n')

    #爬取相應主題的URL
    #<div class="cbmiddle"></div>中<a target="_blank" href="/html/5533.html" >
    count=1 #計算每頁1-75中詳細網頁個數
    res_div = r'<div class="cbmiddle">(.*?
 
)</div>' 
    m_div = re.findall(res_div,content,re.S|re.M)
    for line in m_div:
        #fileurl.write(line+'\n')
        #獲取每頁全部主題相應的URL並輸出
        if "_blank" in line: #防止獲取列表list/1_0_1.html list/2_0_1.html
            #獲取主題
            fileurl.write('\n\n********************************************\n')
            title_pat = r'<b class="imgname">(.*?
 
)</b>'
            title_ex = re.compile(title_pat,re.M|re.S)
            title_obj = re.search(title_ex, line)
            title = title_obj.group()
            print unicode(title,'utf-8')
            fileurl.write(title+'\n')
            #獲取URL
            res_href = r'<a target="_blank" href="(.*?
)"'
            m_linklist = re.findall(res_href,line)
            #print unicode(str(m_linklist),'utf-8')
            for link in m_linklist:
                fileurl.write(str(link)+'\n') #形如"/html/5533.html"

                '''
                **************************************************
                #第二步 去到詳細圖像頁面 下載HTML頁面
                #http://pic.yxdown.com/html/5533.html#p=1
                #註意先本地創建yxdown 否則報錯No such file or directory
                **************************************************
                '''
                #下載HTML網頁無原圖 故加'#p=1'錯誤
                #HTTP Error 400. The request URL is invalid.
                html_url = 'http://pic.yxdown.com'+str(link)
                print html_url
                html_content = urllib.urlopen(html_url).read() #詳細站點內容
                #可凝視它 暫不下載靜態HTML
                open('yxdown/yxdown_html'+str(count)+'.html','w+').write(html_content)


                '''
                #第三步 去到圖片界面下載圖片
                #圖片的鏈接地址為http://pic.yxdown.com/html/5530.html#p=1 #p=2
                #點擊"查看原圖"HTML代碼例如以下
                #<a href="javascript:;" style=""onclick="return false;">查看原圖</a>
                #通過JavaScript實現 並且該界面存儲全部圖片鏈接<script></script>之間
                #獲取"original":"http://i-2.yxdown.com/2015/3/18/6381ccc..3158d6ad23e.jpg"
                '''
                html_script = r'<script>(.*?
)</script>'
                m_script = re.findall(html_script,html_content,re.S|re.M)
                for script in m_script:
                    res_original = r'"original":"(.*?
)"' #原圖
                    m_original = re.findall(res_original,script)
                    for pic_url in m_original:
                        print pic_url
                        fileurl.write(str(pic_url)+'\n')

                        '''
                        #第四步 下載圖片
                        #假設瀏覽器存在驗證信息如維基百科 需加入例如以下代碼
                            class AppURLopener(urllib.FancyURLopener):
                                version = "Mozilla/5.0"
                            urllib._urlopener = AppURLopener()
                        #參考 http://bbs.csdn.net/topics/380203601
                        #http://www.lylinux.org/python使用多線程下載圖片.html
                        '''
                        filename = os.path.basename(pic_url) #去掉文件夾路徑,返回文件名稱
                        #No such file or directory 須要先創建文件Picture3
                        urllib.urlretrieve(pic_url, 'E:\\Picture3\\'+filename)
                        #http://pic.yxdown.com/html/5519.html
                        #IOError: [Errno socket error] [Errno 10060] 
                
                #僅僅輸出一個URL 否則輸出兩個同樣的URL
                break 
            
            #當前頁詳細內容個數加1
            count=count+1
            time.sleep(0.1)  
        else:
            print 'no url about content'
        
    time.sleep(1)  
    num=num+1
else:
    print 'Download Over!!!'

當中下載http://pic.yxdown.com/list/0_0_1.html的圖片E:\\Picture目錄例如以下：技術分享

下載http://pic.yxdown.com/list/0_0_3.html的圖片E:\\Picture3目錄例如以下：技術分享

因為代碼凝視中有具體的步驟。以下僅僅是簡介過程。

1.簡單遍歷站點。獲取每頁相應主題的URL。當中每頁都有無數個主題。當中主題的格式例如以下：

<!-- 第一步 爬取的HTML代碼例如以下 -->
<div class="conbox">
  <div class="cbtop">
  </div>
  <div class="cbmiddle">
  <a target="_blank" href="/html/5533.html" class="proimg">
    <img src="http://i-2.yxdown.com/2015/3/19/KDE5Mngp/a78649d0-9902-4086-a274-49f9f3015d96.jpg" alt="Miss大小姐駕到！
細數《英雄聯盟》圈的電競女神" />
    <strong></strong>
    <p>
      <span>b></b>1836人看過</span>
      <em><b></b>10張</em>
    </p>
    <b class="imgname">Miss大小姐駕到！
細數《英雄聯盟》圈的電競女神</b>
  </a>
  <a target="_blank" href="/html/5533.html" class="plLink"><em>1</em>人評論</a>
  </div>
  <div class="cbbottom">
  </div>
  <a target="_blank" class="plBtn" href="/html/5533.html"></a>
</div>

它是由無數個<div class="conbox"></div>組成，當中我們僅僅須要提取<a target="_blank" href="/html/5533.html" class="proimg">中的href就可以，然後通過URL拼接實現到詳細的主題頁面。當中相應上面的布局例如以下圖所看到的：技術分享

2.去到詳細圖像頁面下載HTML頁面。如：
http://pic.yxdown.com/html/5533.html#p=1
同一時候下載本地HTML頁面能夠凝視該句代碼。此時須要點擊“查看圖片”才幹下載原圖。點擊右鍵僅僅能另存為站點html。技術分享

3.我最初打算是是分析“查看原圖”的URL來實現下載，其它站點同理是分析“下一頁”來實現的。

但我發現它是通過JavaScript實現的瀏覽，即：
<a href="javascript:;" onclick="return false;" id="photoOriginal">查看原圖</a>
同一時候它把全部圖片都寫在以下代碼<script></script>中：

<script>var images = [
{ "big":"http://i-2.yxdown.com/2015/3/18/KDkwMHgp/6381ccc0-ed65-4422-8671-b3158d6ad23e.jpg",
  "thumb":"http://i-2.yxdown.com/2015/3/18/KHgxMjAp/6381ccc0-ed65-4422-8671-b3158d6ad23e.jpg",
  "original":"http://i-2.yxdown.com/2015/3/18/6381ccc0-ed65-4422-8671-b3158d6ad23e.jpg",
  "title":"","descript":"","id":75109},
{ "big":"http://i-2.yxdown.com/2015/3/18/KDkwMHgp/fec26de9-8727-424a-b272-f2827669a320.jpg",
  "thumb":"http://i-2.yxdown.com/2015/3/18/KHgxMjAp/fec26de9-8727-424a-b272-f2827669a320.jpg",
  "original":"http://i-2.yxdown.com/2015/3/18/fec26de9-8727-424a-b272-f2827669a320.jpg",
  "title":"","descript":"","id":75110},
...
</script>

當中獲取原圖-original就可以，縮略圖-thumb，大圖-big，通過正則表達式下載URL：
res_original = r‘"original":"(.*?)"‘ #原圖
m_original = re.findall(res_original,script)
4.最後一步就是下載圖片。當中我不太會使用線程，僅僅是簡單加入了time.sleep(0.1) 函數。下載圖片可能會遇到維基百科那種訪問受限。須要對應設置。核心代碼例如以下：

import os
import urllib
class AppURLopener(urllib.FancyURLopener):
    version = "Mozilla/5.0"
urllib._urlopener = AppURLopener()
url = "http://i-2.yxdown.com/2015/2/25/c205972d-d858-4dcd-9c8b-8c0f876407f8.jpg"
filename = os.path.basename(url)
urllib.urlretrieve(url , filename)

同一時候我也在本地創建目錄Picture3，並txt記錄獲取的URL，例如以下圖所看到的：技術分享

最後希望文章對大家有所幫助，簡單來說文章就兩句話：怎樣分析源碼通過正則表達式提取指定URL。怎樣通過Python下載圖片。假設文章有不足之處，請海涵！

（By：Eastmount 2015-3-20 下午5點 http://blog.csdn.net/eastmount/）

[python學習] 簡單爬取圖片站點圖庫中圖片

ctu while 要去文章 ava ges file cor nal 近期老師讓學習Python與維基百科相關的知識，無聊之中用Python簡單做了個爬取“遊訊網圖庫”中的圖片，由於每次點擊下一張感覺很浪費時間又繁瑣。主要分享的是怎樣爬取HTML

[python學習] 簡單爬取圖片站點圖庫中圖片

[python學習] 簡單爬取圖片站點圖庫中圖片

Python之簡單爬取網頁內容

python3 爬取飛G圖girl13.com 圖片

python爬蟲（爬取蜂鳥網高畫素圖片）_空網頁,錯誤處理

python3 學習 3：python爬蟲之爬取動態載入的圖片，以百度圖片為例

Python簡單爬取圖片例項

python學習（7）：python爬蟲之爬取動態載入的圖片，以百度圖片為例

Python爬蟲實戰(三):簡單爬取網頁圖片

Python簡易爬蟲爬取百度貼吧圖片

Python爬蟲之爬取煎蛋網妹子圖

python 爬取京東手機圖

Python協程爬取妹子圖(內有福利，你懂得~)

最最簡單的python爬蟲教程--爬取百度百科案例

python 把已爬取圖片鏈接用urllib下載到本地

Python爬蟲入門 | 4 爬取豆瓣TOP250圖書信息

Python練習四:爬取圖片

聰哥哥教你學Python之如何爬取美女圖片

Python：scrapy框架爬取校花網男神圖片儲存到本地

python 爬取動態網頁（百度圖片）

python爬蟲學習之爬取全國各省市縣級城市郵政編碼

[python學習] 簡單爬取圖片站點圖庫中圖片

相關推薦