Python爬小草1024圖片，蓋達爾的誘惑（urllib.request）

阿新 • • 發佈：2019-01-16

.com 設置最重要的 hand 開始 string color 能夠切片

項目說明：

Python版本：3.7.2

模塊：urllib.request，re，os，ssl

目標地址：http://小草.com/

第二個爬蟲項目，設備轉移到了Mac上，Mac上的Pycharm有坑，環境變量必須要配置好，解釋器要選對，不然模塊加載不出來

技術分享圖片

項目實現：

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
#__author__ = ‘vic‘
##導入模塊
import urllib.request,re,os

小草圖片下載時後有ssl證書驗證，我們全局跳過驗證

ssl._create_default_https_context = ssl._create_unverified_context

一、設置代理

小草服務器在海外，需要繞過GFW，代理軟件選擇的是ssX-NG，偏好設置查看監聽地址

技術分享圖片

Path = ‘/Users/vic/Pictures/‘
##設置代理,http和https都用的是http監聽，也可以改為sock5
proxy = urllib.request.ProxyHandler({‘http‘:‘http://127.0.0.1:1087/‘,‘https‘:‘https://127.0.0.1:1087‘})
##創建支持處理HTTP請求的opener對象
opener = urllib.request.build_opener(proxy)
##安裝代理到全局環境
urllib.request.install_opener(opener)
 
##定義請求頭,Upgrade-Insecure-Requests表示能夠處理https
header = {‘Upgrade-Insecure-Requests‘:‘1‘,"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15"}

二、獲取源代碼

def getcontent(url):
    req = urllib.request.Request(url,headers = header)
     
##同requests一樣，request轉為response
    res = urllib.request.urlopen(req)
    content = res.read()
    ##內存及時關閉
    res.close()
    return content

三、獲取列表鏈接

##鏈接最後為相應頁碼
url = ‘http://小草.com/thread0806.php?fid=16&search=&page=‘

分析文章鏈接，也就是http://小草.com/直接加上後綴即可，把所有顏色的鏈接全部扒下來

技術分享圖片

但是有個公告欄只在第一頁有，所以我想到了在第一頁把list切片

def geturl_list(url,i):
    ##列表鏈接+頁碼
    article_url = url + str(i)
    ##轉為字符串
    content = str(getcontent(article_url))
    ##創建正則模式對象,匹配全文鏈接
    pattern = re.compile(r‘<a href="htm_data.{0,30}html" target="_blank" id="">.*?‘)
    ##取出所有匹配內容
    com_list = pattern.findall(content)
    ##如果是第一頁，把公告欄鏈接切片
    if i == 1:
        com_list = com_list[7:]
    ##鏈接正則
    pattern_url = re.compile(r‘a href="(.*?)"‘)
    ##取出所有鏈接後綴
    url_list = pattern_url.findall(str(com_list))
    return url_list

四、獲取圖集信息

先找標題

技術分享圖片

這個簡單，re直接找title就好了

技術分享圖片

然後是圖片地址，圖片的後綴大多是JPG和少量的GIF，但是Python的格式好像太嚴格了？所以圖片格式分別大小寫，圖床地址全是https協議的，最重要的是有大圖片小圖片鏈接，大圖片下載是盜鏈，我解決不了，所以可以等差求奇數鏈接

def getTitle_Imgurl(url):
    content = getcontent(url)
    ##內容轉gbk
    string = content.decode(‘gbk‘, ‘replace‘)
    #print(string)
    m = re.findall("<title>.*</title>", string)
    ##切片去掉標題兩邊的標簽
    title = m[0][7:-35]
    ##圖片地址匹配正則，gif文件太大，我只要jpg格式的
    pattern = re.compile(r‘(https:[^\s]*?(JPG|jpg))‘)
    ##取出圖片地址,返回tuple添加到list裏，tuple結構為（網址，格式類型）
    Imgurl_list = pattern.findall(str(content))
    return title,Imgurl_list

五、下載函數

rllib.request.urlretrieve（）

下載也有坑，這個遠程下載在PC上好像可以直接使用，但是在mac上單文件鏈接可以下載，但是放進程序了死活下不下來，而且還慢，所以還是選擇傳統的

def downImg(url,path,count):
    try:
        req = urllib.request.Request(url, headers=header)
        res = urllib.request.urlopen(req)
        content = res.read()
        with open(path +  ‘/‘ + str(count) + ‘.jpg‘, ‘wb‘) as file:
            file.write(content)
            file.close()
    except:
        print(‘ERROR‘)

六、主函數

def main():
    ##1到20頁列表
    for i in range(1,20):
        ##第一頁文章列表
        url_list = geturl_list(url,i)
        ##文章地址拼接,list從0開始
        for t in range(0,len(url_list) - 1):
            artical_url = ‘http://小草.com/‘ + url_list[t]
            print(artical_url)
            ##取標題，圖片地址list
            title, Imgurl_list = getTitle_Imgurl(artical_url)
            ##創建文件夾
            Img_Path = Path + title
            if not os.path.isdir(Img_Path):
                os.mkdir(Img_Path)
                ##循環圖片地址,小圖片和大圖片通過取奇數解決,大圖片下載會得到盜鏈
                for num in range(1,len(Imgurl_list) - 1,2):
                    Imgurl = Imgurl_list[num][0]
                    downImg(Imgurl,Img_Path,num)
            else:
                print(‘已下載跳過‘)

七、全部代碼

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
#__author__ = ‘vic‘
import urllib.request,re,os,ssl
ssl._create_default_https_context = ssl._create_unverified_context
url=‘http://小草.com/thread0806.php?fid=16&search=&page=‘
Path = ‘/Users/vic/Pictures/‘
proxy = urllib.request.ProxyHandler({‘http‘:‘http://127.0.0.1:1087/‘,‘https‘:‘https://127.0.0.1:1087‘})
opener = urllib.request.build_opener(proxy)
urllib.request.install_opener(opener)
header = {‘Upgrade-Insecure-Requests‘:‘1‘,"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15"}

def getcontent(url):
    req = urllib.request.Request(url,headers = header)
    res = urllib.request.urlopen(req)
    content = res.read()#.decode(‘gbk‘,‘replace‘) 
    res.close()
    return content
def geturl_list(url,i):
    article_url = url + str(i)
    content = str(getcontent(article_url))
    pattern = re.compile(r‘<a href="htm_data.{0,30}html" target="_blank" id="">.*?‘)
    com_list = pattern.findall(content)
    if i == 1:
        com_list = com_list[7:]
    pattern_url = re.compile(r‘a href="(.*?)"‘)
    url_list = pattern_url.findall(str(com_list))
    return url_list
def getTitle_Imgurl(url):
    content = getcontent(url)
    string = content.decode(‘gbk‘, ‘replace‘)
    m = re.findall("<title>.*</title>", string)
    title = m[0][7:-35]
    pattern = re.compile(r‘(https:[^\s]*?(JPG))‘)
    Imgurl_list = pattern.findall(str(content))
    return title,Imgurl_list
def downImg(url,path,count):
    try:
        req = urllib.request.Request(url, headers=header)
        res = urllib.request.urlopen(req)
        content = res.read()
        with open(path +  ‘/‘ + str(count) + ‘.jpg‘, ‘wb‘) as file:
            file.write(content)
            file.close()
    except:
        print(‘ERROR‘)
def main():
    for i in range(1,20):
        url_list = geturl_list(url,i)
        for t in range(0,len(url_list) - 1):
            artical_url = ‘http://小草com/‘ + url_list[t]
            print(artical_url)
            title, Imgurl_list = getTitle_Imgurl(artical_url)
            Img_Path = Path + title
            if not os.path.isdir(Img_Path):
                os.mkdir(Img_Path)
                for num in range(1,len(Imgurl_list) - 1,2):
                    Imgurl = Imgurl_list[num][0]
                    downImg(Imgurl,Img_Path,num)
            else:
                print(‘已下載跳過‘)
if __name__ == ‘__main__‘:
    if not os.path.isdir(Path):
        os.mkdir(Path)
    main()

八、項目成果

技術分享圖片

文件名也是成等差了，有點尷尬，就這樣吧。

最後總的來說爬蟲，BeautifulSoup要比正則好用的多，requests也要比urllib.request簡單，搞了一晚上，等兩天再爬其他的

Python爬小草1024圖片，蓋達爾的誘惑（urllib.request）

.com 設置最重要的 hand 開始 string color 能夠切片項目說明： Python版本：3.7.2 模塊：urllib.request，re，os，ssl 目標地址：http://小草.com/ 第二個爬蟲項目，設備轉移到了Mac上，Mac上的Pych

Python爬小草1024圖片，蓋達爾的誘惑（urllib.request）

Python爬小草1024圖片，蓋達爾的誘惑（urllib.request）

利用Python爬取攝影網站圖片，切勿商用

Python爬取全站妹子圖片，差點硬碟走火了！

Cocos creator製作微信小遊戲儲存圖片，音訊檔案到本地（手機，瀏覽器）

微信小程式實現圖片懶載入的懶辦法（思路參考）

Python突破12306最後一道防線，實現自動搶票（附原始碼）

Python爬蟲之爬取內涵吧段子（urllib.request）

python爬取百度圖片---釋出exe小計編碼是個大坑

python爬取電影原始碼，小編以後看電影再也不用VIP了（有程式碼）

python，爬蟲爬取網頁的圖片，基礎改善版

python爬取微博圖片數據存到Mysql中遇到的各種坑python Mysql存儲圖片

python爬取百度圖片代碼

我用 Python 爬取微信好友，最後發現一個大秘密

Python爬取全書網小說，免費看小說

分手後，小夥怒用Python爬取上萬空姐照片，贏取校花選舉大賽！

把python爬出來的數據，用pymysql插入數據庫中

Python爬取抖音APP，竟然只需要十行程式碼

Python 爬取百度圖片的高清原圖

Python爬取網頁的圖片資料

【python】重新命名移動圖片，rename file 不管是什麼格式的檔案

Python爬小草1024圖片，蓋達爾的誘惑（urllib.request）

相關推薦