爬蟲1.1爬取鬥圖啦圖片（關於open函式和urlretrieve函式）

阿新 • • 發佈：2019-01-07

文章只是我作為NewBird ٩꒰▽ ꒱۶⁼³₌₃ 學習的一小點小點的進步

還請不要笑我⁄(⁄ ⁄•⁄ω⁄•⁄ ⁄)⁄

我就直接貼程式碼了，我不會說很技術的話。

1.建立專案命令：

scrapy startproject <project_name>

例子：

scrapy startproject myproject

建立成功如下圖：

資料夾目錄如下

1 2 3 4 5 6 7 8 9 10

myproject/

scrapy.cfg         
 -------專案的配置檔案

myproject/        
 -------該專案的python模組。之後您將在此加入程式碼。

__init__.py

items.py      
 --------專案中的item檔案.

pipelines.py 
 --------專案中的pipelines檔案

settings.py   
 --------專案的設定檔案

spiders/       
 --------放置spider程式碼的目錄

__init__.py ...

以上覆制了別人的https://www.cnblogs.com/pachongshou/p/6125858.html

2.明確目標——Item.py

import scrapy

class Doutu2Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    img_url=scrapy.Field()
    name=scrapy.Field()

3.製作爬蟲，先爬再取——spider.py

# -*- coding: utf-8 -*-
import scrapy
from ..items import Doutu2Item
from pyquery import PyQuery as pq
import os
import requests
from  urllib import request
import re
class DoutuSpider(scrapy.Spider):
    name = 'doutu'
allowed_domains = ['doutula.com']

    start_urls = ['http://doutula.com/photo/list/?page={}' 
.format(i)for i in range(1,3)]

    def parse(self, response):
        jpy=pq(response.text)
        #我這裡使用了PyQuery
Zurl=jpy('#pic-detail > div > div.col-sm-9 > div.random_picture > ul > li > div > div>a').items()
        i=0
for it in Zurl:
            #遍歷Zurl
print(it.text())
            #例項化item物件，進行儲存
item=Doutu2Item()
            #PyQuery中獲取屬性attr()
            #以下是動圖和jpg的url獲取
item['img_url']=it('img').attr('data-original')
            item['name']=it('p').text()
            if  not item['img_url']:
                item['img_url']=it('img').eq(1).attr('data-original')
            print(item['img_url'])
            i+=1
# if os.path.exists('鬥圖'):
            #     print('資料夾已存在')
            # else:
            #     os.makedirs('鬥圖')
            #     print('資料夾已經建立')
if not os.path.exists('doutu'):
                print('建立資料夾:{}'.format('doutu'))
                os.mkdir('doutu')
            if not os.path.exists('pic'):
                print('建立資料夾:{}'.format('pic'))
                os.mkdir('pic')
            #正則表示式替換名稱中有特殊字元的
rstr = r"[\/\\\:\*\?\"\<\>\|]"  # '/ \ : * ? " < > |'
new_title = re.sub(rstr, "_", item['name'])  # 替換為下劃線
            #第一種儲存方式：開啟檔案路徑的時候我不太會弄，所以錯了幾次，相對路徑比較好，
with open('pic/%s.jpg' % new_title,'wb') as f:
                f.write(requests.get(item['img_url']).content)
            #第二種儲存方式
try:
                request.urlretrieve(item['img_url'],'doutu\%s.gif'% new_title)
            except:
                pass
print(i)
            print('__________________________________________________')
        yield  item

4.處理spider抽取的item——pipeline.py

from scrapy.exceptions import DropItem
from scrapy import log
import json
from pymongo import MongoClient
from  scrapy.conf import settings


class Doutu2Pipeline(object):

#這裡是初始化方法

#進行資料庫的一些設定
    def __init__(self):
        connection=MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db=connection[settings['MONGODB_DB']]
        self.collection=db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        self.collection.insert(dict(item))
        print("我已進入資料庫")
        valid=True
        for data in item:
            if not data:
                valid=False
                raise DropItem('MIssing{}'.format(data))
        if valid:
            log.msg('已經進入資料庫',level=log.DEBUG,spider=spider)


        return item

__init__方法在類的一個物件被建立時，馬上執行。這個方法可以用來對你的物件做一些你希望的 初始化 。注意，這個名稱的開始和結尾都是雙下劃線

5.配置檔案——setting.py

# Scrapy settings for doutu2 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'doutu2'
SPIDER_MODULES = ['doutu2.spiders']
NEWSPIDER_MODULE = 'doutu2.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   'doutu2.middlewares.Doutu2SpiderMiddleware': 543,
}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'doutu2.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'doutu2.pipelines.Doutu2Pipeline': 300,
}
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "test"
MONGODB_COLLECTION = "doutu2"

6.中介軟體進行request和response一些設定——middlewares.py

這個暫時還不怎麼U

爬蟲1.1爬取鬥圖啦圖片（關於open函式和urlretrieve函式）

文章只是我作為NewBird ٩꒰▽ ꒱۶⁼³₌₃ 學習的一小點小點的進步還請不要笑我⁄(⁄ ⁄•⁄ω⁄•⁄ ⁄)⁄ 我就直接貼程式碼了，我不會說很技術的話。 1.建立專案命令： scrapy startproject <project_nam

Python 爬取鬥圖啦圖片

鬥圖啦 requests BeautifulSoup4 程式碼 # -*- coding:utf-8 -*- # pip install requests 框架 import requests # pip install beautifulsoup4 框架 # p

用python爬取鬥圖啦圖片

一、程式碼部分 # -*- coding:utf-8 -*- '''1、python版本 python3.6 2、IDE PyCharm 2017.3 ''' import requests imp

Python爬蟲入門教程 13-100 鬥圖啦表情包多執行緒爬取

寫在前面今天在CSDN部落格，發現好多人寫爬蟲都在爬取一個叫做鬥圖啦的網站，裡面很多表情包，然後瞅了瞅，各種實現方式都有，今天我給你實現一個多執行緒版本的。關鍵技術點 aiohttp ，你可以看一下我前面的文章，然後在學習一下。網站就不分析了，無非就是找到規律，拼接URL，匹配關鍵點，然後爬取。擼

Python爬蟲入門教程 13-100 鬥圖啦表情包多線程爬取

.text 入門教程地址 ESS 文件頭部 https .html 一個 mat 寫在前面今天在CSDN博客，發現好多人寫爬蟲都在爬取一個叫做鬥圖啦的網站，裏面很多表情包，然後瞅了瞅，各種實現方式都有，今天我給你實現一個多線程版本的。關鍵技術點 aiohttp ，你可以

python3爬蟲 -----爬取鬥圖息-------www.doutula.com

run __init__ args gin uid == utf-8 date src 普通爬取： 1 # -*- coding:utf-8 -*- 2 # author:zxy 3 # Date:2018-10-21 4 import requests 5 f

python爬蟲爬取鬥圖網最新表情包（第二篇）

上一篇文章爬的表情包是套圖，發現還有一千多頁的最新表情包。兩者的網頁結構有點區別，程式碼需要整改下，看下頁面，規律也比較好找。非常氣憤，上一個部落格被其他爬走了，還是一個培訓機構，插了自己的廣告！所有的表情圖片都是在標籤下，數了一下每一頁都是17行，

學會用python網路爬蟲爬取鬥圖網的表情包，聊微信再也不怕鬥圖了

最近總是有人跟我鬥圖，想了想17年中旬時在網上看過一篇關於爬取鬥圖網表情包的py程式碼，但是剛想爬的時候發現網頁結構發生了變化，而且鬥圖網還插入了很多廣告，變化其實挺大的，所以臨時寫了一個爬蟲，簡單的爬取了鬥圖網的表情包。從這連結上看，page表示的是第幾頁，我

用Python多線程實現生產者消費者模式爬取鬥圖網的表情圖片

Python什麽是生產者消費者模式某些模塊負責生產數據，這些數據由其他模塊來負責處理（此處的模塊可能是：函數、線程、進程等）。產生數據的模塊稱為生產者，而處理數據的模塊稱為消費者。在生產者與消費者之間的緩沖區稱之為倉庫。生產者負責往倉庫運輸商品，而消費者負責從倉庫裏取出商品，這就構成了生產者消費者模式。生

Python 爬蟲入門之爬取妹子圖

Python 爬蟲入門之爬取妹子圖來源：李英傑連結： https://segmentfault.com/a/1190000015798452 聽說你寫程式碼沒動力？本文就給你動力，爬取妹子圖。如果這也沒動力那就沒救了。 GitHub 地址:&

爬取鬥圖網表情包之後鬥圖會輸？不存在的

前言：本文非常淺顯易懂，可以說是零基礎也可快速掌握。如有疑問，歡迎留言，筆者會第一時間回覆。一、分析表情包網址 1、進入鬥圖啦網址，點選**“最新表情”**，再點選第二、第三頁，得出規律

shell爬取鬥圖網

#!/bin/bash read -p "請輸入要爬取的頁面數(預設為10)：" page_num page_num=${page_num:-10} echo $page_num read -p "請輸入要儲存的目錄名稱(預設為img)：" save_path_name save_path_name=

網路爬蟲-使用Scrapy爬取千圖網素材

話說好久好久好久沒寫過scrapy的demo了，已經快忘得差不多了，今天一個小老弟讓我幫他看看怎麼大量快速爬取千圖網的素材，我進網站看了看，一是沒有什麼反爬措施，二是沒有封ip的限制，那這種情況，鐵定用scrapy這個非同步框架最舒服了，於是花了十幾分鍾看了看自

python多執行緒爬蟲+批量下載鬥圖啦圖片專案（關注、持續更新）

python多執行緒爬蟲專案（）爬取目標：鬥圖啦（起始url：http://www.doutula.com/photo/list/?page=1）爬取內容：鬥圖啦全網圖片使用工具：requests庫實現傳送請求、獲取響應。　　　　　　　xpath實現資料解析、提取和清洗　　　　　　　thr

Python又來爬取妹子圖啦，一個T的硬盤都不夠用

chrome 三方動態加載 python bsp img 第三方庫 post請求 mode 淘女郎爬蟲，可動態抓取淘女郎的信息和照片。需要額外安裝的第三方庫 requests pip install requests pymongo pip install p

python爬蟲-使用多程序爬取美圖-人工智慧語言（高效爬蟲）

import os from multiprocessing.pool import Pool from urllib.parse import urlencode from hashlib import md5 import requests def loaDpage(fullurl):

多執行緒爬取鬥圖圖片

結果演示程式碼： #encoding:utf-8 # __author__ = 'donghao' # __time__ = 2018/12/24 15:20 import requests import threading import urllib.re

Python爬取鬥圖表情，讓你成為鬥圖大佬

話不多說，上結果（只爬了10頁內容）上程式碼：（可直接執行）用到Xpath #encoding:utf-8 # __author__ = 'donghao' # __time__ = 2018/12/24 15:20 import requests imp

Python資料爬蟲學習筆記（11）爬取千圖網圖片資料

需求：在千圖網http://www.58pic.com中的某一板塊中，將一定頁數的高清圖片素材爬取到一個指定的資料夾中。分析：以數碼電器板塊為例 1.檢視該板塊的每一頁的URL：注意到第一頁是“0-1.html”，第二頁是“0-2.html”，由

爬蟲1.1爬取鬥圖啦圖片（關於open函式和urlretrieve函式）

相關推薦