學會用python網路爬蟲爬取鬥圖網的表情包，聊微信再也不怕鬥圖了

阿新 • • 發佈：2019-01-06

最近總是有人跟我鬥圖，想了想17年中旬時在網上看過一篇關於爬取鬥圖網表情包的py程式碼，但是剛想爬的時候發現網頁結構發生了變化，而且鬥圖網還插入了很多廣告，變化其實挺大的，所以臨時寫了一個爬蟲，簡單的爬取了鬥圖網的表情包。從這連結上看，page表示的是第幾頁，我只爬取了500多頁（很奇怪白天明明看到一共有一千多頁的，為啥晚上就只有548頁？），純屬娛樂，表情包夠用就行。
這裡寫圖片描述
重點還是在於解析網頁，頁面上每一欄都是一組圖，這組圖有一個連結指向，所以我只要提取到這個連結，再開啟這個連結，然後在新的網頁上提取表情圖片，下載下來就行了。解析網頁使用了python的xpath，剩下的就是數學思維了，迴圈巢狀和判斷什麼的。
原始碼截圖如下（使用的是python3）：

#coding=utf8
import sys
defaultencoding = 'utf-8'
if sys.getdefaultencoding() != defaultencoding:
    reload(sys)
    sys.setdefaultencoding(defaultencoding)
import requests
from lxml import etree
from urllib import request
from bs4 import BeautifulSoup
import time

headers = {}
headers["User-Agent" 
] = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0"
headers["Host"] = "www.doutula.com"
headers["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
headers["Accept-Language"] = "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2"
headers["Accept-Encoding"] = "gzip, deflate" 

headers["Connection"] = "close"
headers["Upgrade-Insecure-Requests"] = "1"

def get_link(page):
    url = "http://www.doutula.com/article/list/?page=" + str(page)
    response = requests.get(url,headers=headers)
    html = response.text
    link_list = []
    selector = etree.HTML(html)
    link1 = selector.xpath('//a[@class="list-group-item random_list tg-article"]/@href')
    for i in link1:
        link_list.append(i)
    link2 = selector.xpath('//a[@class="list-group-item random_list"]/@href')
    for i in link2:
        link_list.append(i)
    return link_list
j = 1
def get_img(link_list):
    global j
    img_url = []
    for url in link_list:
        response = requests.get(url,headers=headers)
        time.sleep(1)
        html = response.text
        selector = etree.HTML(html)
        soup = BeautifulSoup(html,"lxml")
        tb = soup.find_all("table")
        i = 1
        while i < len(tb)+1:
            try:
                img_link = selector.xpath("//table[%d]/tbody[1]/tr[1]/td[1]/a/img/@src" % i)
                img_url.append(img_link)
                i += 1
            except Exception as e:
                print(str(e))
    for img in img_url:
        for image_link in img:
            print(image_link)
            if image_link[-3:]=="gif":
                request.urlretrieve(image_link, 'F:\\image\\%s.gif' % str(j))
            else:
                if image_link[-3:]=="png":
                    request.urlretrieve(image_link, 'F:\\image\\%s.png' % str(j))
                else:
                    request.urlretrieve(image_link, 'F:\\image\\%s.jpg' % str(j))
            time.sleep(1)
            j += 1
for page in range(1,5):
    try:
        get_img(get_link(page))
    except Exception as e:
        print(str(e))

這裡寫圖片描述

後面覺得直接使用數字名字的效果不太好，還是需要給圖片命個名字，這樣好搜尋自己需要什麼樣子的表情。
所以修改了原始碼。先上效果圖：

原始碼如下：

#coding=utf8
import sys
defaultencoding = 'utf-8'
if sys.getdefaultencoding() != defaultencoding:
    reload(sys)
    sys.setdefaultencoding(defaultencoding)
import requests
from lxml import etree
from urllib import request
from bs4 import BeautifulSoup
import time

headers = {}
headers["User-Agent"] = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0"
headers["Host"] = "www.doutula.com"
headers["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
headers["Accept-Language"] = "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2"
headers["Accept-Encoding"] = "gzip, deflate"
headers["Connection"] = "close"
headers["Upgrade-Insecure-Requests"] = "1"

def get_link(page):
    url = "http://www.doutula.com/article/list/?page=" + str(page)
    response = requests.get(url,headers=headers)
    html = response.text
    link_list = []
    selector = etree.HTML(html)
    link1 = selector.xpath('//a[@class="list-group-item random_list tg-article"]/@href')
    for i in link1:
        link_list.append(i)
    link2 = selector.xpath('//a[@class="list-group-item random_list"]/@href')
    for i in link2:
        link_list.append(i)
    return link_list
j = 1
def get_img(link_list):
    global j
    img_url = []
    img_name = []
    for url in link_list:
        response = requests.get(url,headers=headers)
        time.sleep(1)
        html = response.text
        selector = etree.HTML(html)
        soup = BeautifulSoup(html,"lxml")
        tb = soup.find_all("table")
        i = 1
        while i < len(tb)+1:
            try:
                img_link = selector.xpath("//table[%d]/tbody[1]/tr[1]/td[1]/a/img/@src" % i)
                image_name = selector.xpath("//table[%d]/tbody[1]/tr[1]/td[1]/a/img/@alt" % i)
                img_url.append(img_link)
                img_name.append(image_name)
                i += 1
            except Exception as e:
                print(str(e))
    for img in img_url:
        img_id = img_url.index(img)
        for image_link in img:
            image_id = img.index(image_link)
            print("下載第%d張表情包："%j + image_link)
            if image_link[-3:]=="gif":
                request.urlretrieve(image_link, 'F:\\image\\%s.gif' % str(img_name[img_id][image_id]))
            else:
                if image_link[-3:]=="png":
                    request.urlretrieve(image_link, 'F:\\image\\%s.png' % str(img_name[img_id][image_id]))
                else:
                    request.urlretrieve(image_link, 'F:\\image\\%s.jpg' % str(img_name[img_id][image_id]))
            time.sleep(1)
            j += 1
for page in range(4,11):
    try:
        get_img(get_link(page))
    except Exception as e:
        print(str(e))

學會用python網路爬蟲爬取鬥圖網的表情包，聊微信再也不怕鬥圖了

學會用python網路爬蟲爬取鬥圖網的表情包，聊微信再也不怕鬥圖了

python網路爬蟲爬取汽車之家的最新資訊和照片

python網路爬蟲爬取房價資訊

用Python分分鐘爬取豆瓣本周口碑榜，就是有這麽秀！

用python寫網路爬蟲-爬取新浪微博評論

用網路爬蟲爬取新浪新聞----Python網路爬蟲實戰學習筆記

Python網路資料爬取----網路爬蟲基礎（一）

初學python：用簡單的爬蟲爬取豆瓣電影TOP250的排名

用網路爬蟲爬取該網頁所有頁碼的所有圖片

Python簡易爬蟲爬取百度貼吧圖片

網路爬蟲-爬取指定城市空氣質量檢測資料

python：爬蟲爬取資料的處理之Json字串的處理（2）

用selenium製作爬蟲爬取教務課程資訊

用 python 寫爬蟲爬取得資料儲存方式

用Python分分鐘爬取豆瓣本週口碑榜，就是有這麼秀！

python爬蟲爬取詩詞名句網

一個簡單的網路爬蟲---爬取網頁中的圖片

電影就要用Python一鍵爬取，你說呢？

python網路爬蟲--抓取股票資訊到Mysql

Python爬蟲-爬取慕課網課程

學會用python網路爬蟲爬取鬥圖網的表情包，聊微信再也不怕鬥圖了

相關推薦