用selenium以外的方法實現爬取海報時尚網熱門圖片

阿新 • • 發佈：2018-12-25

廢話不多說, 直接上程式碼! ! !

import json
import os
import time
from urllib.request import urlretrieve
import requests
import datetime
import urllib.parse
import re

"""
介面連線 http://pic.haibao.com/ajax/image:getHotImageList.json?stamp=Thu%20Dec%2013%202018%2008:45:30%20GMT+0800%20(%E4%B8%AD%E5%9B%BD%E6%A0%87%E5%87%86%E6%97%B6%E9%97%B4)
分析介面url可以看出, 實際url是由前部分url+後面的當時的日期時間拼接成的
"""
# 構造實際的url地址
GMT_FORMAT = '%a %d %b %Y %H:%M:%S GMT'
# 生成Thu Dec 13 2018 08:45:30 GMT 0800格式的datetime物件
date_time = datetime.datetime.utcnow().strftime(GMT_FORMAT)
week = date_time[:3]
month = date_time[7:10]
day = date_time[4:6]
h_m_t = date_time[11:]
url_str = "http://pic.haibao.com/ajax/image:getHotImageList.json?param={}"
param = week + " " + month + " " + day + " " + h_m_t + " " + "(中國標準時間)"
param = urllib.parse.quote(param)
url = url_str.format(param)
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}


class Imgspider(object):
    def __init__(self):
        self.headers = headers
        self.url = url

    def first_page(self):
        data = {
            "skip": 75
        }
        print("開始爬取::::::第1頁")
        respone = requests.post(url=self.url, data=data, headers=self.headers).text
        # with open("test.html", "w") as fp:
        # fp.write(respone.text)
        str = json.loads(respone)
        html = str["result"]["html"]
        partten = re.compile(r'data-original="(.*?)"')
        hasmore = str["result"]["hasMore"]
        skip = str["result"]["skip"]
        lt = partten.findall(html)
        page = 1
        img_dir = "Img{}".format(page)
        num = 1
        if not os.path.exists("img_dir"):
            os.mkdir(img_dir)
        for url in lt:
            try:
                time.sleep(0.5)
                print("開始下載:::::第{}張圖片".format(num))
                urlretrieve(url, "Img{}/{}.jpg".format(page, num))
                print("結束下載:::::第{}張圖片".format(num))
                time.sleep(0.5)
                num += 1
            except Exception as e:
                print(e)
        print("結束爬取::::::第1頁")
        return hasmore, skip

    def run(self):
        hasmore, skip = self.first_page()
        print(hasmore, skip)
        page = 2
        while hasmore == 1:
            print("開始爬取::::::第{}頁".format(page))
            data = {
                "skip": skip
            }
            print(skip)
            respone = requests.post(url=self.url, data=data, headers=self.headers).text
            str = json.loads(respone)
            html = str["result"]["html"]
            partten = re.compile(r'data-original="(.*?)"')
            hasmore = str["result"]["hasMore"]
            skip = str["result"]["skip"]
            print(skip)
            lt = partten.findall(html)
            img_dir = "Img{}".format(page)
            num = 1
            if not os.path.exists("img_dir"):
                os.mkdir(img_dir)
            else:
                pass
            for url in lt:
                try:
                    time.sleep(0.5)
                    print("開始下載:::::第{}張圖片".format(num))
                    urlretrieve(url, "Img{}/{}.jpg".format(page, num))
                    print("結束下載:::::第{}張圖片".format(num))
                    time.sleep(0.5)
                    num += 1
                except Exception as e:
                    print(e)
            print("結束下載::::::第{}頁".format(page))
            page += 1


img = Imgspider()
img.run()

爬取的圖片如下

用selenium以外的方法實現爬取海報時尚網熱門圖片

廢話不多說, 直接上程式碼! ! ! import json import os import time from urllib.request import urlretrieve import requests import datetime import urllib.parse

使用scrapy框架+模擬瀏覽器方法實現爬取智聯的職位資訊

由於智聯的頁面是由js動態載入的,一般的方法只能得到js載入前的頁面,為了得到載入過的頁面需要通過模擬瀏覽器來拿到完整的頁面. 下面的程式碼只是簡單的實現,爬取智聯頁面的部分功能,其他根據需要自己實現中介軟體(middleswares.py)程式碼: from scrapy.ht

爬蟲學習之18：使用selenium和chrome-headerless爬取淘寶網商品資訊（非同步載入網頁）

登入淘寶網，使用F12鍵觀察網頁結構，會發現淘寶網也是非同步載入網站。有時候通過逆向工程區爬取這類網站也不容易。這裡使用selenium和chrome-headerless來爬取。網上有結合selenium和PlantomJS來爬取的，但是最新版的Seleniu

網路爬蟲簡單的實現爬取百度貼吧圖片

我們要爬取的網站是https://tieba.baidu.com/p/3797994694 首先爬取第一頁的圖片，使用python3自帶庫urllib，詳細的程式碼如下：接下來爬去多頁的圖片，這裡我們選取五頁的圖片，這裡我們採用requests，beautifuls

python爬取美空網女神圖片，小心記憶體走火

爬蟲分析首先，我們已經爬取到了N多的使用者個人主頁，我通過連結拼接獲取到了 www.moko.cc/post/da39db… 在這個頁面中，咱們要找幾個核心的關鍵點，發現平面拍攝點選進入的是圖片列表頁面。接下來開始程式碼走起。獲取所有列表頁面我

爬蟲專案：scrapy爬取暱圖網全站圖片

一、建立專案、spider，item以及配置setting建立專案：scrapy startproject nitu建立爬蟲：scrapy genspider -t basic nituwang nipic.com寫個item：# -*- coding: utf-8 -*-

用Python多線程實現生產者消費者模式爬取鬥圖網的表情圖片

Python什麽是生產者消費者模式某些模塊負責生產數據，這些數據由其他模塊來負責處理（此處的模塊可能是：函數、線程、進程等）。產生數據的模塊稱為生產者，而處理數據的模塊稱為消費者。在生產者與消費者之間的緩沖區稱之為倉庫。生產者負責往倉庫運輸商品，而消費者負責從倉庫裏取出商品，這就構成了生產者消費者模式。生

python3實現爬取淘寶頁面的商品的資料資訊（selenium+pyquery+mongodb）

1.環境須知做這個爬取的時候需要安裝好python3.6和selenium、pyquery等等一些比較常用的爬取和解析庫，還需要安裝MongoDB這個分散式資料庫。 2.直接上程式碼 spider.py import re from config

ios網絡學習------3 用非代理方法實現異步post請求

erro form b2c enc 界面關聯 error pre mut #pragma mark - 這是私有方法。盡量不要再方法中直接使用屬性,由於一般來說屬性都是和界面關聯的，我們能夠通過參數的方式來使用屬性 #pragma mark post登錄方法 -(v

python實現爬取30頁百度校園女神圖片！

dpi 分享圖片 ges pat path lis 校園 one sha 1、以下是源代碼import requestsimport osdef getManyPages(keyword,pages): params=[] for i in range(30,3

使用selenium 多線程爬取愛奇藝電影信息

連接獲取 ict 容易出錯 span column 分享圖片 odi attribute 使用selenium 多線程爬取愛奇藝電影信息轉載請註明出處。爬取目標：每個電影的評分、名稱、時長、主演、和類型爬取思路：源文件：（有註釋） from seleniu

教你分分鐘學會用python爬蟲框架Scrapy爬取你想要的內容

python 爬蟲 Scrapy python爬蟲教你分分鐘學會用python爬蟲框架Scrapy爬取心目中的女神 python爬蟲學習課程，下載地址：https://pan.baidu.com/s/1v6ik6YKhmqrqTCICmuceug 課程代碼原件：課程視頻：教你分分鐘學會用py

scrapy初探之實現爬取小說

scrapy 爬取小說一、前言上文說明了scrapy框架的基礎知識，本篇實現了爬取第九中文網的免費小說。二、scrapy實例創建 1、創建項目 C:\Users\LENOVO\PycharmProjects\fullstack\book9>scrapy startproject book

selenium+chrome瀏覽器驅動-爬取百度圖片

com max-age col presence and 下載其他 htm row 百度圖片網頁中中，當頁面滾動到底部，頁面會加載新的內容。我們通過selenium和谷歌瀏覽器驅動，執行js，是瀏覽器不斷加載頁面，通過抓取頁面的圖片路徑來下載圖片。 1 from s

Java學習——方法中傳遞參數分簡單類型與復雜類型（引用類型）編程計算100＋98＋96＋。。。＋4＋2+1的值，用遞歸方法實現

dig oid 傳遞 system alt style 類型遞歸 gen package hello; public class digui { public static void main(String[] args) { /

用etree和Beautiful Soup爬取騰訊招聘網站

1.lxml 是一種使用 Python 編寫的庫,可以迅速、靈活地處理 XML ，支援 XPath (XML Path Language)，使用 lxml 的 etree 庫來進行爬取網站資訊 2.Beautiful Soup支援從HTML或XML檔案中提取資料的Python庫；支援Python標準庫中的H

HttpClient 實現爬取百度搜索結果（自動翻頁）

如果你對HttpClient還不是很瞭解，建議先移步我的另一篇部落格HttpClient4.x之請求示例後再來看這篇部落格。我們這裡的專案採用maven搭建。在閱讀前要對jdk和maven有一定的瞭解。另外開發工具這裡我這裡使用的是：Spring Tool Suite（STS）當然你也可以使用其

Python實現爬取好友頭像拼接成大圖！這不就暴露了我的好友了！

前言筆者無意間發現一個有趣的第三方庫itchat,itchat模組是一位叫little codersh的大神寫的模組，附上大神的github地址,有興趣的朋友可以去嘗試玩一下itchat模組，很有趣的！！！ https://github.com/littlecodersh/ItChat

用python爬取拉勾網招聘資訊並以CSV檔案儲存

爬取拉勾網招聘資訊 1、在網頁原始碼中搜索資訊，並沒有搜到，判斷網頁資訊使用Ajax來實現的 2、檢視網頁中所需的資料資訊，返回的是JSON資料； 3、條件為北京+資料分析師的公司一共40087家，而實際拉勾網展示的資料只有 15條/頁 * 30頁 = 450條，所以需要判斷

python用協程池非同步爬取音樂的json資料

# -*- coding: utf-8 -*- # @Author : Acm import gevent.monkey gevent.monkey.patch_all() from gevent.pool import Pool from Queue import Queue imp

用selenium以外的方法實現爬取海報時尚網熱門圖片

相關推薦