Python-爬取妹子圖(單執行緒和多執行緒版本)

阿新 • • 發佈：2018-11-25

一、參考文章

上述文章中的程式碼講述的非常清楚，我的基本能思路也是這樣，本篇文章中的程式碼僅僅做了一些異常處理和一些日誌顯示優化工作，寫此文章主要是當做筆記，方便以後查閱，修改的地方如下：

1、異常處理下面在程式碼中會單獨標紅

2、多執行緒版使用了multiprocessing這個庫，需要在main函式開始呼叫freeze_support()，防止打包成exe之後，執行時建立執行緒失敗

3、多執行緒版本加了一個命令列自定義執行緒個數功能

二、單執行緒版本

 1 #coding=utf-8 

 2 import requests
 3 from bs4 import BeautifulSoup
 4 import os
 5 
 6 all_url = 'http://www.mzitu.com'
 7 
 8 
 9 #http請求頭
10 Hostreferer = {
11     'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
12     'Referer':'http://www.mzitu.com'
13                }
14 Picreferer = {
 
15     'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
16     'Referer':'http://i.meizitu.net'
17 }
18 #此請求頭破解盜鏈
19 
20 start_html = requests.get(all_url, headers = Hostreferer)
21 
22 #儲存地址
23 path = os.getcwd() + '/mzitu/'
24 
25 #找尋最大頁數
26 soup = BeautifulSoup(start_html.text, " 
html.parser")
27 page = soup.find_all('a', class_='page-numbers')
28 max_page = page[-2].text
29 
30 
31 same_url = 'http://www.mzitu.com/page/'
32 for n in range(0, int(max_page)+1):#遍歷頁面數
33     ul = same_url+str(n)
34     start_html = requests.get(ul, headers = Hostreferer)
35     soup = BeautifulSoup(start_html.text, "html.parser")
36     all_a = soup.find('div', class_ = 'postlist').find_all('a', target = '_blank')
37     for a in all_a:#每個頁面包含的妹子數
38         title = a.get_text() #提取文字
39         if(title != ''):
40             print("準備扒取：" + title)
41 
42             #win不能建立帶？的目錄
43             if(os.path.exists(path+title.strip().replace('?', ''))):
44                     #print('目錄已存在')
45                     flag = 1
46             else:
47                 os.makedirs(path+title.strip().replace('?', ''))
48                 flag = 0
49             os.chdir(path + title.strip().replace('?', ''))
50             href = a['href']
51             html = requests.get(href, headers = Hostreferer)
52             mess = BeautifulSoup(html.text, "html.parser")
53             pic_max = mess.find_all('span')
54             pic_max = pic_max[10].text #最大頁數
55             if(flag == 1 and len(os.listdir(path+title.strip().replace('?', ''))) >= int(pic_max)):
56                 print('已經儲存完畢，跳過')
57                 continue
58             for num in range(1, int(pic_max) + 1):#每個妹子的所有照片
59                 pic = href+'/'+str(num)
60                 html = requests.get(pic, headers = Hostreferer)
61                 mess = BeautifulSoup(html.text, "html.parser")
62                 pic_url = mess.find('img', alt = title)
63                
64                 if 'src' not in pic_url.attrs:#有些pic_url標籤沒有src這個屬性，導致操作異常，在次進行過濾
65                     continue
66                 print(pic_url['src'])
67                 #exit(0)
68                 html = requests.get(pic_url['src'],headers = Picreferer)
69                 file_name = pic_url['src'].split(r'/')[-1]
70                 f = open(file_name, 'wb')
71                 f.write(html.content)
72                 f.close()
73             print('完成')
74     print('第',n,'頁完成')

三、多執行緒版本

 1 #coding=utf-8
 2 import requests
 3 from bs4 import BeautifulSoup
 4 import os
 5 from multiprocessing import Pool
 6 from multiprocessing import freeze_support
 7 import sys
 8 
 9 header = {
10     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36',
11     'Referer':'http://www.mzitu.com'
12     }
13 Picreferer = {
14     'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
15     'Referer':'http://i.meizitu.net'
16 }
17 
18 def find_MaxPage():
19     all_url = 'http://www.mzitu.com'
20     start_html = requests.get(all_url, headers = header)
21     #找尋最大妹子頁面數
22     soup = BeautifulSoup(start_html.text, "html.parser")
23     page = soup.find_all('a', class_ = 'page-numbers')
24     max_page = page[-2].text
25     return max_page
26 
27 def Download(href, title, path):
28     html = requests.get(href, headers = header)
29     soup = BeautifulSoup(html.text, 'html.parser')
30     pic_max = soup.find_all('span')
31     pic_max = pic_max[10].text  # 最大頁數
32     if(os.path.exists(path+title.strip().replace('?', '')) 
33     and len(os.listdir(path+title.strip().replace('?', ''))) >= int(pic_max)):
34         print('妹子已待命，繼續準備下一個妹子' + title)
35         return 1
36     print(f"發現妹子資源{pic_max}個，準備中：" + title)
37     os.makedirs(path + title.strip().replace('?', ''))
38     os.chdir(path + title.strip().replace('?', ''))
39     for num in range(1, int(pic_max) + 1):
40         pic = href + '/' + str(num)
41         html = requests.get(pic, headers = header)
42         mess = BeautifulSoup(html.text, "html.parser")
43         pic_url = mess.find('img', alt = title)
44         if 'src' not in pic_url.attrs:#有些pic_url標籤沒有src屬性，導致操作異常，在次進行過濾
45             continue
46         print(f"{title}：{pic_url['src']}")
47         html = requests.get(pic_url['src'], headers = header)
48         file_name = pic_url['src'].split(r'/')[-1]
49         f = open(file_name,'wb')
50         f.write(html.content)
51         f.close()
52     print('妹子已就緒，客官請慢用：' + title)
53 
54 if __name__ == '__main__':
55     freeze_support()#防止打包後 執行exe建立程序失敗
56     
57     #執行緒池中執行緒數
58     count = 1
59     if len(sys.argv) >=2:
60         count = int(sys.argv[1])
61         
62     pool = Pool(count)
63     print(f'初始化下載執行緒個數${count}')
64 
65     # http請求頭
66     path = os.getcwd() + '/mzitu_mutil/'
67     max_page = find_MaxPage() #獲取最大頁數  即生成的資料夾數量
68     print(f'捕獲{max_page}頁妹子，請耐心等待下載完成')
69     same_url = 'http://www.mzitu.com/page/'
70 
71     for n in range(1, int(max_page) + 1):
72         each_url = same_url + str(n)
73         start_html = requests.get(each_url, headers = header)#請求一頁中的所有妹子
74         soup = BeautifulSoup(start_html.text, "html.parser")
75         all_a = soup.find('div', class_ = 'postlist').find_all('a', target = '_blank')
76         for a in all_a:#遍歷每一頁中的妹子
77             title = a.get_text()  # 提取文字
78             if (title != ''):
79                 href = a['href']#請求妹子的所有圖集
80                 pool.apply_async(Download, args = (href, title, path))
81     pool.close()
82     pool.join()
83     print('所有妹子已就緒，客官請慢用')

四、資源下載

　　資源下載地址：Python爬取妹子圖-單執行緒和多執行緒版本

Python-爬取妹子圖(單執行緒和多執行緒版本)

一、參考文章 Python爬蟲之——爬取妹子圖片上述文章中的程式碼講述的非常清楚，我的基本能思路也是這樣，本篇文章中的程式碼僅僅做了一些異常處理和一些日誌顯示優化工作，寫此文章主要是當做筆記，方便以後查閱，修改的地方如下： 1、異常處理

Python 爬取妹子圖(注意身體/滑稽)

... #!/usr/bin/env python import urllib.request from bs4 import BeautifulSoup def crawl(url): headers = {'User-Agent':'Mozilla/5.0 (Windows; U; W

教你用Python爬取妹子圖APP

教你用Python爬美之圖APP(妹子圖) 爬取結果程式只運行了2h,最後認為程式沒有問題了就關了(我可不是去殺生去了…… 執行環境 Python 3.5+ Windows 10 VSCode 如何使用下載專案原始碼 https

Python-爬取校花網視訊(單執行緒和多執行緒版本)

一、參考文章 python爬蟲爬取校花網視訊，單執行緒爬取爬蟲----爬取校花網視訊，包含多執行緒版本上述兩篇文章都是對校花網視訊的爬取，由於時間相隔很久了，校花網上的一些視訊已經不存在了，因此上

Python協程爬取妹子圖(內有福利，你懂得~)

split 基本保存文件切換代碼執行怎麽辦什麽 head .cn 項目說明：　　1、項目介紹　　　本項目使用Python提供的協程+scrapy中的選擇器的使用(相當好用)實現爬取妹子圖的(福利圖)圖片，這個學會了，某榴什麽的、pow(2, 10)是吧！

Python 爬蟲入門之爬取妹子圖

Python 爬蟲入門之爬取妹子圖來源：李英傑連結： https://segmentfault.com/a/1190000015798452 聽說你寫程式碼沒動力？本文就給你動力，爬取妹子圖。如果這也沒動力那就沒救了。 GitHub 地址:&

python 多程序爬取妹子圖

程式碼需要自行修改的有：圖片儲存位置、程序池的容量（建議cpu幾個核就設定為少，我的是4核）可以在主函式簡單修改 ''' author:James-J time:2018/09/20 version: v2

Python又來爬取妹子圖啦，一個T的硬盤都不夠用

chrome 三方動態加載 python bsp img 第三方庫 post請求 mode 淘女郎爬蟲，可動態抓取淘女郎的信息和照片。需要額外安裝的第三方庫 requests pip install requests pymongo pip install p

Python爬蟲——利用requests模組爬取妹子圖

近期學了下python爬蟲，利用requests模組爬取了妹子圖上的圖片，給單身狗們發波福利，哈哈！順便記錄一下第一次發部落格。話不多說，進入正題開發環境 python 3.6 涉及到的庫 requests lxml 先上一波爬取的截圖

Python爬取千圖網PS素材圖片

宣告：僅用於學習交流，請勿用於任何商業用途！感謝大家！需求：在千圖網http://www.58pic.com中的某一板塊中，將一定頁數的高清圖片素材爬取到一個指定的資料夾中。分析：以數碼電器板塊為例檢視該板塊的每一頁的URL：

Python爬去妹子圖上傳到wordpress並使用阿里雲oss

#!/usr/bin/env python # coding=utf-8 import os import time import threading, datetime, hashlib import oss2 import phpserialize from multi

Python3爬蟲系列：理論+實驗+爬取妹子圖實戰

爬蟲系列： (1) 理論 (2) 實驗 (3) 實戰 1. 準備環境 1.1 安裝CentOS 1.2 安裝Python3 1.3 安裝MongoDB 嘗試使用motor實現

萌新爬蟲的動力就是爬取妹子圖！批量爬取妹子圖喲！

進群：960410445 即可獲取原始碼！目錄前言 Media Pipeline 啟用Media Pipeline 使用 ImgPipeline 抓取妹子圖瞎比比與送書後話前言我們在抓取資料的過程中，除了要抓取

使用python爬取智圖地圖切片

智圖（www.qeoq.cn）是國內領先的網路地圖提供商。本例抓取智圖網路地圖區域性切片並進行拼接以滿足本地使用。使用python3.6抓取資料 import urllib.request #此處經緯度範圍可根據具體要求調整 #本例範圍為成都 xmin=51695 xmax

python 單執行緒和多執行緒

單執行緒，在好些年前的MS-DOS時代，作業系統處理問題都是單任務的，我想做聽音樂和看電影兩件事兒，那麼一定要先排一下順序。 #coding=utf-8 import threading from time import ctime,sleep def

Python爬取鬥圖表情，讓你成為鬥圖大佬

話不多說，上結果（只爬了10頁內容）上程式碼：（可直接執行）用到Xpath #encoding:utf-8 # __author__ = 'donghao' # __time__ = 2018/12/24 15:20 import requests imp

Python 爬取鬥圖啦圖片

鬥圖啦 requests BeautifulSoup4 程式碼 # -*- coding:utf-8 -*- # pip install requests 框架 import requests # pip install beautifulsoup4 框架 # p

用python爬取鬥圖啦圖片

一、程式碼部分 # -*- coding:utf-8 -*- '''1、python版本 python3.6 2、IDE PyCharm 2017.3 ''' import requests imp

Python3x 爬取妹子圖

思路：1、get_totalpages(url) 通過【性。感。美。女。圖】獲得該版塊的總頁數【首頁1234567891011下一頁末頁共 21頁1034條】 2、get_sercoverurl(pageurl) 版塊每一頁有50個系列的封面，獲得每個封面的地址。 3、進入該封面（即系列），獲得該系列

Python爬取妹子網圖片

提取文章標題 import requests from bs4 import BeautifulSoup url = 'http://www.mzitu.com/26685' header = {'User-Agent': 'Mozilla/5.0 (

Python-爬取妹子圖(單執行緒和多執行緒版本)

一、參考文章

二、單執行緒版本

三、多執行緒版本

四、資源下載

相關推薦