聰哥哥教你學Python之如何爬取美女圖片

阿新 • • 發佈：2018-11-07

今天要講的是Python。Python目前主要是在人工智慧和資料分析上比較火。這裡我們就講它的資料分析。什麼叫資料分析呢？

簡單地說，根據已知資料，經過分析，得出結論。這就叫做資料分析。

今天聰哥哥我拿一個簡單的爬蟲例項，教你爬取美女圖片，不過在此之前聰哥哥我得說說一些雜七雜八的。

這個教程，需要一定的Python基礎，TCP/IP協議也得懂，具有一定的瀏覽器除錯或者抓包經驗。

當然了，最重要的是一個學習的心，一顆積極上進的心。

當然了，慾望也可以。聰哥哥我曾經看過一本叫《人類簡史》的書，雖然當時沒有很深的看，不過心中卻產生了一個大膽的想法和論斷，那就是，人類之所以進化並走到了現在，不外乎這兩個字，“慾望”。也許已經有幾百萬年，畢竟有些專家的論斷也不一定是可靠的，這個世界有太多的未知數。

不是非常瞭解和熟悉Python的小夥伴們，我在此推薦一個教程，廖雪峰Python教程:https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000

這個教程既有例項又有理論，雙向結合。

個人建議學習這個教程時，應當採取的策略是:閱讀+實踐。

閱讀+實踐，針對的人群是有一定的程式設計基礎，比如你學過C/C++，或者是被譽為世界上最強大的語言PHP。有一定的程式設計基礎對於學習是非常有幫助的。當然了，還有一個更重要的就是興趣。曾經記得某位大師說過:興趣是最好的老師。我覺得一個人如何想要在技術這條路長遠的走下去，興趣是一個很重要的因素。

不過這個興趣你可以分多種角度來看。

比如你真正對這門程式語言發自內心的愛，比如你收夠了PHP的變態語法，覺得Python是如此的平易近人。

或者是你受夠了C的很多難以駕馭的特性，比方說面向結構程式設計不如面向物件來的實際痛苦。面向結構，一聽這個詞，就不爽，結構有什麼意思，還不如物件來的實際。一聽物件這詞就一個字爽。

再比如你對某某感興趣，不打比方的，就直接說，你對美女圖片非常感興趣，每天不看就睡不著。記得我曾經的一個同學就是這樣。每天費盡心機的到處搜尋，還不如寫個爬蟲，大量的爬取圖片，自己上傳到百度雲或者是其他的儲存雲上，想什麼時候看就什麼時候看，多爽啊。你看，這種人如果將自己的某某興趣轉移到學習，不說這個人一定會有一番大業，至少這個人，年薪百萬不是夢。

下面進入正題（程式碼貼器，啪啪啪，稍微幽默下，記得某位名叫YOU什麼去的大師曾說過:一個人如果沒有幽默感，那將是一件非常可怕的事情）:

test001.py(單程序)

#coding=utf-8
import requests
from bs4 import BeautifulSoup
import os
import sys
'''
#安卓端需要此語句
reload(sys)
sys.setdefaultencoding('utf-8')
'''

if(os.name == 'nt'):
        print(u'你正在使用win平臺')
else:
        print(u'你正在使用linux平臺')

header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}
#http請求頭
all_url = 'http://www.mzitu.com'
start_html = requests.get(all_url,headers = header)

#儲存地址
path = 'D:/test/'

#找尋最大頁數
soup = BeautifulSoup(start_html.text,"html.parser")
page = soup.find_all('a',class_='page-numbers')
max_page = page[-2].text


same_url = 'http://www.mzitu.com/page/'
for n in range(1,int(max_page)+1):
    ul = same_url+str(n)
    start_html = requests.get(ul, headers=header)
    soup = BeautifulSoup(start_html.text,"html.parser")
    all_a = soup.find('div',class_='postlist').find_all('a',target='_blank')
    for a in all_a:
        title = a.get_text() #提取文字
        if(title != ''):
            print("準備扒取："+title)

            #win不能建立帶？的目錄
            if(os.path.exists(path+title.strip().replace('?',''))):
                    #print('目錄已存在')
                    flag=1
            else:
                os.makedirs(path+title.strip().replace('?',''))
                flag=0
            os.chdir(path + title.strip().replace('?',''))
            href = a['href']
            html = requests.get(href,headers = header)
            mess = BeautifulSoup(html.text,"html.parser")
            pic_max = mess.find_all('span')
            pic_max = pic_max[10].text #最大頁數
            if(flag == 1 and len(os.listdir(path+title.strip().replace('?',''))) >= int(pic_max)):
                print('已經儲存完畢，跳過')
                continue
            for num in range(1,int(pic_max)+1):
                pic = href+'/'+str(num)
                html = requests.get(pic,headers = header)
                mess = BeautifulSoup(html.text,"html.parser")
                pic_url = mess.find('img',alt = title)
                html = requests.get(pic_url['src'],headers = header)
                file_name = pic_url['src'].split(r'/')[-1]
                f = open(file_name,'wb')
                f.write(html.content)
                f.close()
            print('完成')
    print('第',n,'頁完成')

這個Python指令碼如果執行報錯，說是沒有安裝requests模組。

那麼，你可以通過pip install requests 完成安裝對應的依賴庫即可，這個依賴庫與Node.js中Npm的共同點，都可以相當於依賴庫的管理。或者換句話說，pip 與ubuntu的 sudo apt-get install 安裝軟體的策略倒是十分相似。它們到底有什麼區別，這裡的重點不在於此。這裡另外想要告訴你的一個IT哲理就是:技術無論千變萬化，把握其本質，就可以以不變應萬變。

當然了，這個不變應萬變並不代表就不學習了。學習是人一生中的必做之事。比如男孩蛻變為一個男人，這也是一種學習。

學習無處不在，大家自行領悟。

test002.py(多程序)

#coding=utf-8
import requests
from bs4 import BeautifulSoup
import os
from multiprocessing import Pool
import sys


def find_MaxPage():
    all_url = 'http://www.mzitu.com'
    start_html = requests.get(all_url,headers = header)
    #找尋最大頁數
    soup = BeautifulSoup(start_html.text,"html.parser")
    page = soup.find_all('a',class_='page-numbers')
    max_page = page[-2].text
    return max_page

def Download(href,header,title,path):
    html = requests.get(href,headers = header)
    soup = BeautifulSoup(html.text,'html.parser')
    pic_max = soup.find_all('span')
    pic_max = pic_max[10].text  # 最大頁數
    if(os.path.exists(path+title.strip().replace('?','')) and len(os.listdir(path+title.strip().replace('?',''))) >= int(pic_max)):
        print('已完畢，跳過'+title)
        return 1
    print("開始扒取：" + title)
    os.makedirs(path+title.strip().replace('?',''))
    os.chdir(path + title.strip().replace('?',''))
    for num in range(1,int(pic_max)+1):
        pic = href+'/'+str(num)
        #print(pic)
        html = requests.get(pic,headers = header)
        mess = BeautifulSoup(html.text,"html.parser")
        pic_url = mess.find('img',alt = title)
        html = requests.get(pic_url['src'],headers = header)
        file_name = pic_url['src'].split(r'/')[-1]
        f = open(file_name,'wb')
        f.write(html.content)
        f.close()
    print('完成'+title)

def download(href,header,title):

    html = requests.get(href,headers = header)
    soup = BeautifulSoup(html.text,'html.parser')
    pic_max = soup.find_all('span')
    #for j in pic_max:
        #print(j.text)
    #print(len(pic_max))
    pic_max = pic_max[10].text  # 最大頁數
    print(pic_max)


'''
#安卓端需要此語句
reload(sys)
sys.setdefaultencoding('utf-8')
'''


if __name__=='__main__':
    if (os.name == 'nt'):
        print(u'你正在使用win平臺')
    else:
        print(u'你正在使用linux平臺')

    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}
    # http請求頭
    path = 'D:/test/'
    max_page = find_MaxPage()
    same_url = 'http://www.mzitu.com/page/'

    #執行緒池中執行緒數
    pool = Pool(5)
    for n in range(1,int(max_page)+1):
        each_url = same_url+str(n)
        start_html = requests.get(each_url, headers=header)
        soup = BeautifulSoup(start_html.text, "html.parser")
        all_a = soup.find('div', class_='postlist').find_all('a', target='_blank')
        for a in all_a:
            title = a.get_text()  # 提取文字
            if (title != ''):
                href = a['href']
                pool.apply_async(Download,args=(href,header,title,path))
    pool.close()
    pool.join()
    print('所有圖片已下完')

第一個指令碼執行完畢，你會很疑惑，為什麼爬取的圖片都顯示不能開啟呢？明明資源就在哪，卻什麼都看不到，心裡頓時不爽。

然後，發現還有第二個指令碼，於是執行了，還是發現，兩個指令碼之間除了單執行緒執行和多執行緒執行的區別外，就沒什麼區別了。

其實原因，很簡單，你還要一個忽略的，那就是防盜鏈，這個防盜鏈，你可以理解為反爬蟲，爬蟲早幾年前是非常火爆的，那個時候，不少人因為爬蟲而實現了財富自由。但是，隨著會爬蟲的人越來越多，人家網站也不是二百五，總是被你牽著鼻子走，防盜策略還是要的。

下面最後一個指令碼，你會發現，當你執行完畢後，你就可以盡情的嘿嘿嘿嘿了

test003.py

#coding=utf-8
import requests
from bs4 import BeautifulSoup
import os

all_url = 'http://www.mzitu.com'


#http請求頭
Hostreferer = {
    'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Referer':'http://www.mzitu.com'
               }
Picreferer = {
    'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Referer':'http://i.meizitu.net'
}
#此請求頭破解盜鏈

start_html = requests.get(all_url,headers = Hostreferer)

#儲存地址
path = 'D:/test/'

#找尋最大頁數
soup = BeautifulSoup(start_html.text,"html.parser")
page = soup.find_all('a',class_='page-numbers')
max_page = page[-2].text


same_url = 'http://www.mzitu.com/page/'
for n in range(1,int(max_page)+1):
    ul = same_url+str(n)
    start_html = requests.get(ul, headers = Hostreferer)
    soup = BeautifulSoup(start_html.text,"html.parser")
    all_a = soup.find('div',class_='postlist').find_all('a',target='_blank')
    for a in all_a:
        title = a.get_text() #提取文字
        if(title != ''):
            print("準備扒取："+title)

            #win不能建立帶？的目錄
            if(os.path.exists(path+title.strip().replace('?',''))):
                    #print('目錄已存在')
                    flag=1
            else:
                os.makedirs(path+title.strip().replace('?',''))
                flag=0
            os.chdir(path + title.strip().replace('?',''))
            href = a['href']
            html = requests.get(href,headers = Hostreferer)
            mess = BeautifulSoup(html.text,"html.parser")
            pic_max = mess.find_all('span')
            pic_max = pic_max[10].text #最大頁數
            if(flag == 1 and len(os.listdir(path+title.strip().replace('?',''))) >= int(pic_max)):
                print('已經儲存完畢，跳過')
                continue
            for num in range(1,int(pic_max)+1):
                pic = href+'/'+str(num)
                html = requests.get(pic,headers = Hostreferer)
                mess = BeautifulSoup(html.text,"html.parser")
                pic_url = mess.find('img',alt = title)
                print(pic_url['src'])
                #exit(0)
                html = requests.get(pic_url['src'],headers = Picreferer)
                file_name = pic_url['src'].split(r'/')[-1]
                f = open(file_name,'wb')
                f.write(html.content)
                f.close()
            print('完成')
    print('第',n,'頁完成')

最終的結果如圖所示:

小結:最後強調一句，結果不是最重要的，最重要的是這一個過程你學到了什麼。

一句話，學習得帶有一個明確的目的，這樣你才會學的更快。另外上面的圖只不過就是一個案例，我希望這個案例能促進廣大的IT朋友們的學習熱情，讓大家的IT之路越走越順。如果能達到這個目的，聰哥哥我也就覺得值了。

聰哥哥教你學Python之如何爬取美女圖片

聰哥哥教你學Python之爬取金庸系列的小說

聰哥哥教你學Python之如何爬取美女圖片

聰哥哥教你學Python之電子郵件

聰哥哥教你學Python之使用MySQL

聰哥哥教你學Python之網路程式設計

聰哥哥教你學Python之面向物件程式設計

聰哥哥教你學Python之模組

聰哥哥教你學Python之函數語言程式設計

聰哥哥教你學Python之高階特性

聰哥哥教你學Python之函式

聰哥哥教你學Python之基礎

聰哥哥教你學Python之常見問題解決

聰哥哥教你學Python之3D畫圖

關於聰哥哥教你學Python

python爬蟲-爬取美女圖片

教你如何用Python爬取美女圖片

想學習爬蟲的小夥伴進來，看我獨特的風格分分鐘教你學python爬蟲

手把手教你學python第十三講（MRO詳解和神奇的魔法方法）

菜鳥學爬蟲之爬取網易新聞

Python之爬取IP代理網站

聰哥哥教你學Python之如何爬取美女圖片

相關推薦