Python-爬蟲-抓取頭條街拍圖片-1.1

阿新 • • 發佈：2019-01-14

requested 所有圖片 parse 信息保存 creat type 一個 fse cursor

下面實例是抓取頭條圖片信息，只是抓取了查詢列表返回的json中image，大圖標，由於該結果不會包含該鏈接詳情頁的所有圖片列表；因此這裏抓取不全；後續有時間在完善；

1、抓取頭條街拍相關圖片請求如下：

技術分享圖片

2、通過debug可以看到請求參數以及相應結果數據：

3、響應結果，比較重要的是data（group_id,image_list、large_image_url等字段）：

技術分享圖片

主程序如下：

抓取圖片信息保存本地，然後將圖片組和圖片信息保存至mysql數據庫；

  1 #今日頭條街拍數據抓取，將圖片存入文件目錄，將文件目錄存放至mysql數據庫
  2 import requests
  3 import 
  time
  4 from urllib.parse import urlencode
  5 import urllib.parse
  6 import os
  7 from requests import Request, Session
  8 import pymysql
  9 class TouTiaoDeep:
 10     def __init__(self):
 11         self.url=‘https://www.toutiao.com/search_content/‘
 12         self.imagePath=‘D:/toutiao/images/‘
 13 
         self.headers={
 14             ‘Accept‘:‘application/json, text/javascript‘,
 15             ‘Accept-Encoding‘:‘gzip, deflate, br‘,
 16             ‘Content-Type‘:‘application/x-www-form-urlencoded‘,
 17             ‘Host‘: ‘www.toutiao.com‘,
 18             ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0 
‘,
 19             ‘X-Requested-With‘: ‘XMLHttpRequest‘
 20         }
 21         self.param={
 22             ‘offset‘:0,
 23             ‘format‘:‘json‘,
 24             ‘keyword‘:  ‘街拍‘,
 25             ‘autoload‘:‘true‘,
 26             ‘count‘:20,
 27             ‘cur_tab‘:1,
 28             ‘form‘:‘search_tab‘,
 29             ‘pd‘:‘synthesis‘
 30          }
 31         self.filePath="D:/toutiaoImages"
 32         self.imgDict={} #{rows:[{title:‘‘,pathName:‘‘,images:[{name:‘‘,desc:‘‘,date:‘‘,downloadUrl:‘‘}...},...] ]}
 33 
 34     def getImgDict(self,offset):
 35         self.param[‘offset‘]=offset#偏移量
 36         session=Session()
 37         req=Request(method=‘GET‘,url=self.url ,params=self.param,headers=self.headers  )
 38         prep =  session.prepare_request(req)
 39         res = session.send(prep)
 40         #print(res.status_code)
 41         if res.status_code==200:
 42             json=res.json()
 43             #print(json)
 44             for i in range(len(json[‘data‘])):
 45                 if  ‘has_image‘ in json[‘data‘][i].keys() and json[‘data‘][i][‘has_image‘]:#其中有視頻列表組，因此排除那些視頻組
 46                     # print("標題：",json[‘data‘][i][‘title‘])
 47                     # print("圖庫：",json[‘data‘][i][‘image_list‘])
 48                     # print("圖庫簡介：",json[‘data‘][i][‘abstract‘])
 49                     # print("圖片個數：",(len(json[‘data‘][i][‘image_list‘])))
 50                     yield {
 51                         ‘group_id‘:json[‘data‘][i][‘group_id‘],
 52                         ‘groupTitle‘:json[‘data‘][i][‘title‘],
 53                         ‘groupImages‘:json[‘data‘][i][‘image_list‘],
 54                         ‘total‘:len(json[‘data‘][i][‘image_list‘]),
 55                         ‘abstract‘:json[‘data‘][i][‘abstract‘],
 56                         ‘large_image_url‘:json[‘data‘][i][‘large_image_url‘][:json[‘data‘][i][‘large_image_url‘].rindex(‘/‘)]
 57                         # 例如：http://p3-tt.bytecdn.cn/large/pgc-image/2dc7e3cd2e0c46f69ee67c11c13ff58e  最後一個是圖片id，前面是大圖片地址（每一組大圖片地址不同）
 58                         # print(item[‘large_image_url‘][:item[‘large_image_url‘].rindex(‘/‘)])#獲取組大圖片的地址url
 59                     }
 60     def imagesDownLoad(self,offset):
 61         # 獲得當前時間時間戳
 62         now = int(time.time())
 63         #轉換為其他日期格式,如:"%Y-%m-%d %H:%M:%S"
 64         timeStruct = time.localtime(now)
 65         strTime = time.strftime("%Y-%m-%d %H:%M:%S", timeStruct)
 66 
 67         datas=self.getImgDict(offset)
 68         for item in datas:
 69             #print(item)
 70             #下載圖片信息
 71             groupImages=item[‘groupImages‘]
 72             print(item[‘groupTitle‘])
 73             for i in groupImages:
 74                 #print(i[‘url‘][(i[‘url‘].rindex(‘/‘)):])截取圖片id即，圖片地址最有一個namespace
 75                 imgURL=item[‘large_image_url‘]+i[‘url‘][(i[‘url‘].rindex(‘/‘)):]#拼成完成的image URL
 76                 print(imgURL)
 77                 #創建存儲文件夾,組id命名
 78                 if not os.path.exists(self.imagePath+item[‘group_id‘]):
 79                     os.makedirs(self.imagePath+item[‘group_id‘])
 80                 #獲取圖片存上面指定目錄中
 81                 try:
 82                     a = urllib.request.urlopen(imgURL)
 83                 except :
 84                       a=urllib.request.urlopen("http://p1.pstatp.com/origin/pgc-image/"+i[‘url‘][(i[‘url‘].rindex(‘/‘)):])#註意有一部分圖片url路徑是：http://p1.pstatp.com/origin/pgc-image/7290e8fcfdbc4a458d8ed7a6c1581283[前面的p1 可以任意換成p任意數字即可]
 85                       #註意；改程序在二十左右頁抓取會出現圖片路徑資源錯誤 
 86                 try:
 87                     f = open(self.imagePath+item[‘group_id‘]+"/"+i[‘url‘][(i[‘url‘].rindex(‘/‘)):]+‘.jpg‘, "wb")
 88                     f.write(a.read())
 89                     f.close()
 90                     #持久化圖片信息
 91                     rows_1={
 92                         ‘imageId‘: i[‘url‘][(i[‘url‘].rindex(‘/‘)):],
 93                         ‘imagesource‘: imgURL,
 94                         ‘imageName‘:i[‘url‘][(i[‘url‘].rindex(‘/‘)):]+‘.jpg‘,
 95                         ‘imageDesc‘: ‘無‘,
 96                         ‘groupid‘: item[‘group_id‘]
 97                     }
 98                     self.imageInfPersistent(rows_1)
 99                 except:
100                     print(‘文件下載失敗‘)
101             #持久化圖片組信息
102             rows_2 = {
103                 ‘groupid‘:item[‘group_id‘],
104                 ‘grouptitle‘:item[‘groupTitle‘],
105                 ‘groupdesc‘:item[‘abstract‘],
106                 ‘path‘:‘toutiao/images/‘+item[‘group_id‘],
107                 ‘createTime‘:strTime
108             }
109             self.imgGroupPersistent(rows_2)
110 
111 
112 
113     #mysql數據庫持久化
114     def mysqlPersistent(self,tableName,data):
115         db = pymysql.connect(host=‘localhost‘, user=‘root‘, password=‘admin‘, port=3306, db=‘test‘)
116         cursor = db.cursor()
117         try:
118             columns = ‘,‘.join(data.keys())
119             values = ‘,‘.join([‘%s‘] * len(data))
120             sql =  ‘insert into {table}({keys}) VALUES ({values}) ‘.format(table=tableName, keys=columns, values=values)
121             cursor.execute(sql, tuple(data.values()))
122             db.commit()
123         except:
124             db.rollback()
125         finally:
126             db.close()
127 
128     #持久化圖片組信息
129     def imgGroupPersistent(self,groupDict):
130         #圖組信息表：組id、組標題、組簡介、本地存儲路徑、創建時間
131        self.mysqlPersistent(‘imageGroup‘,groupDict)
132 
133     #持久化圖片信息
134     def imageInfPersistent(self,imageInfDict):
135         #圖片信息表：圖片id、來源地址、簡介、所屬組id
136         self.mysqlPersistent(‘imageInfo‘, imageInfDict)
137 
138     #創建表
139     def createImgTable(self):
140         sql_imgGroup= ‘create table imageGroup(groupid varchar(50) primary key,grouptitle varchar(200)  ,groupdesc text,path varchar(500),createTime varchar(50))‘
141         sql_imgInf=‘create table imageInfo(imageId varchar(50) primary key,imagesource varchar(200) ,imageName varchar(100),imageDesc text,groupid varchar(50) )‘
142         db = pymysql.connect(host=‘localhost‘, user=‘root‘, password=‘admin‘, port=3306, db=‘test‘)
143 
144         cursor = db.cursor()
145         try :
146             cursor.execute(sql_imgGroup)
147             cursor.execute(sql_imgInf)
148         except:
149             print(‘表創建失敗！‘)
150         finally:
151             cursor.close()
152 
153     #刪除表
154     def dropImgTables(self):
155         sql_dropImageGroup = ‘ drop table if exists  imageGroup ‘
156         sql_dropImageInfo = ‘  drop table if exists   imageInfo ‘
157         db = pymysql.connect(host=‘localhost‘, user=‘root‘, password=‘admin‘, port=3306, db=‘test‘)
158 
159         cursor = db.cursor()
160         try:
161             cursor.execute(sql_dropImageGroup)
162             cursor.execute(sql_dropImageInfo)
163         except:
164             print(‘表刪除失敗！‘)
165         finally:
166             cursor.close()
167 
168 
169 if __name__==‘__main__‘:
170     deep=TouTiaoDeep()
171     deep.dropImgTables()#刪除表
172     deep.createImgTable()#創建表
173     #print(deep.getImgDict())
174     for i in range(0,10*20,10):
175         deep.imagesDownLoad(i)
176     #deep.createImgTable()

操作後結果：註意，由於圖片url拼接不能完全百分百正確，因此抓取數據會因為圖片地址錯誤報異常；

技術分享圖片

Python-爬蟲-抓取頭條街拍圖片-1.1

requested 所有圖片 parse 信息保存 creat type 一個 fse cursor 下面實例是抓取頭條圖片信息，只是抓取了查詢列表返回的json中image，大圖標，由於該結果不會包含該鏈接詳情頁的所有圖片列表；因此這裏抓取不全；後續有時間在完善； 1、抓

第一個Python爬蟲-抓取煎蛋網上圖片

背景作為一個計算機基礎薄弱的電氣工程師，廖大的教程看到常用的內建模組時，看的頭大，特別是看到HTMLParser時，已宛如天書了。這時作為一個初學者的劣勢就暴露出來了，我不知道哪部分知識是理解這些模組的前置條件，即使知道是哪部分知識，但不知道該理解到什麼程度才能解決當前的問題。個人建議

Python爬蟲--抓取單一頁面上的圖片文件學習

python 爬蟲 #！/usr/bin/python import sys #正則表達式庫 import re import urllib def getHtml(url): page = urllib.urlopen(url) html = page.read() ret

Python爬蟲 —— 抓取美女圖片

In root lxml 取圖 ext time style main HR 代碼如下： 1 #coding:utf-8 2 # import datetime 3 import requests 4 import os 5 import sys

Python爬蟲 —— 抓取美女圖片（Scrapy篇）

parse color 爬蟲 select 尺度 dex -i www 模塊雜談：之前用requests模塊爬取了美女圖片，今天用scrapy框架實現了一遍。（圖片尺度確實大了點，但老衲早已無戀紅塵，權當觀賞哈哈哈） Item: # -*- codi

python爬蟲-- 抓取網頁、圖片、文章

零基礎入門Python，給自己找了一個任務，做網站文章的爬蟲小專案，因為實戰是學程式碼的最快方式。所以從今天起開始寫Python實戰入門系列教程，也建議大家學Python時一定要多寫多練。目標 1，學習Python爬蟲 2，爬取新聞網站新聞列表 3，爬取圖片 4，把爬取到的資料存在本地

python 爬蟲, 抓取百度美女吧圖片

# ----2018-7-15 ------世界盃總決賽 import requests from lxml import etree import re class TiBa_Image(object): # 建立同意方法 def __init__(

Python爬蟲抓取女演員圖片

介紹利用Python爬蟲抓取日本女演員照片。遇到的最大問題就是該網站用了cloudflare以及其他的策略禁止爬蟲爬取資訊，導致urllib自帶的urlretrieve函式無法使用，而其他部分都較為

python爬蟲抓取圖片

關於python爬蟲一直以來是很著名的，林林總總也有很多方法，大致起來也就是一個原理。下面我來介紹一下我用的BeautifulSoup獲取的，正則獲取也很簡單，在這裡只說一下BeautifulSoup方法，使用伯樂線上網站作為參考的例子程式碼如下 #encoding

python 爬蟲抓取頁面圖片

# -*- coding: utf-8 -*- # path: D:/Python27/img/jpg.py import re import urllib import os #獲取html頁面的內容 def getHtml(url): cont = ur

Python爬蟲抓取煎蛋(jandan.net)無聊圖

下載 logs start input req com read ref color 1 #!/usr/bin/python 2 #encoding:utf-8 3 ‘‘‘ 4 @python 3.6.1 5 @author: [email prote

Python爬蟲抓取東方財富網股票數據並實現MySQL數據庫存儲

alt 插入 pytho width 重新 tab 空值 utf word Python爬蟲可以說是好玩又好用了。現想利用Python爬取網頁股票數據保存到本地csv數據文件中，同時想把股票數據保存到MySQL數據庫中。需求有了，剩下的就是實現了。在開始之前，保證已經

python爬蟲抓取zabbix監控圖，並發郵件

python 抓取最近十九大非常煩，作為政府網站維護人員，簡直是夜不能寐。各種局子看著你，內保局，公安部，360，天融信，華勝天成，中央工委，政治委員會...360人員很傻X，作為安全公司，竟然不能抓到XX網站流量，在我們機房放的探針更是搞笑，讓我們手工上報流量數據。白天還行，晚上怎麽辦？給他寫個腳

Python爬蟲抓取純靜態網站及其資源

遇到的需求前段時間需要快速做個靜態展示頁面，要求是響應式和較美觀。由於時間較短，自己動手寫的話也有點麻煩，所以就打算上網找現成的。中途找到了幾個頁面發現不錯，然後就開始思考怎麼把頁面給下載下來。由於之前還沒有了解過爬蟲，自然也就沒有想到可以用爬蟲來抓取網頁內容。所以我採取的辦法

用python爬蟲抓取視訊網站所有電影

執行環境 IDE丨pycharm 版本丨Python3.6 系統丨Windows ·實現目的與思路· 目的：實現對騰訊視訊目標url的解析與下載，由於第三方vip解析，只提供線上觀看，隱藏想實現對目標視訊的下載思路：首先拿到想要看的騰訊電影url,通過第三方vip視訊解析網站進

Python爬蟲-抓取divnil動漫妹子圖

目標網站 https://divnil.com 首先看看這網站是怎樣載入資料的; 開啟網站後發現底部有下一頁的按鈕，ok，爬這個網站就很簡單了; 學習Python中有不明白推薦加入交流群

Python爬蟲抓取大資料崗位招聘資訊（51job為例）

簡單介紹一下爬蟲原理。並給出 51job網站完整的爬蟲方案。爬蟲基礎知識資料來源網路爬蟲的資料一般都來自伺服器的響應結果，通常有html和json資料等，這兩種資料也是網路爬蟲的主要資料來源。其中html資料是網頁的原始碼，通過瀏覽器-檢視原始碼可

使用python爬蟲抓取學術論文

介紹這是一個很小的爬蟲，可以用來爬取學術引擎的pdf論文，由於是網頁內容是js生成的，所以必須動態抓取。通過selenium和chromedriver實現。可以修改起始點的URL從穀粉搜搜改到谷歌學術引擎，如果你的電腦可以翻牆。可以修改關鍵字和搜尋頁數

Python爬蟲爬取網站上的圖片

Python爬蟲抓取動態資料

一個月前實習導師佈置任務說通過網路爬蟲獲取深圳市氣象局釋出的降雨資料，網頁如下：心想，爬蟲不太難的，當年跟zjb爬煎蛋網無（mei）聊（zi）圖的時候，多麼清高。由於接受任務後的一個月考試加作業一大堆，導師也不催，自己也不急。但是，導師等我一個月都得讓我來寫意味著這

Python-爬蟲-抓取頭條街拍圖片-1.1

相關推薦