多線程高容錯爬頭條街拍美圖

阿新 • • 發佈：2019-04-04

win64 The 容錯改版 hash keys lac exception times

分析頭條的ajax,通過正則表達式，python3多線程高容錯爬取頭條的街拍美圖，保存到mongodb,並下載圖片

頭條的內容網頁較之前已經改版，圖床頁不僅有ajax的還有html的內容網頁

所以使用了兩種正則，根據條件調用

#!/usr/bin/env python
# -*- coding:utf-8 -*-
"""
@author:Aiker
@file:toutiao.py
@time:下午9:35
"""
import json
import os
import re
from json import JSONDecodeError
from multiprocessing import Pool
from urllib.parse import urlencode
from hashlib import md5
import pymongo
import requests
from requests.exceptions import RequestException

MONGO_URL = ‘localhost:27017‘
MONGO_DB = ‘toutiao‘
MONGO_TABLE = ‘toutiao‘
GROUP_START = 1
GROUP_END = 20
KEYWORD = ‘街拍‘
client = pymongo.MongoClient(MONGO_URL, connect=False)
db = client[MONGO_DB]

headers = {
    ‘user-agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36‘

}

def get_url(url):
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print(‘請求失敗‘, url)
        return None

def get_page_index(offset, keyword):
    data = {
        ‘aid‘: ‘24‘,
        ‘app_name‘: ‘web_search‘,
        ‘offset‘: offset,
        ‘format‘: ‘json‘,
        ‘keyword‘: keyword,
        ‘autoload‘: ‘true‘,
        ‘count‘: ‘20‘,
        ‘en_qc‘: ‘1‘,
        ‘cur_tab‘: ‘1‘,
        ‘from‘: ‘search_tab‘,
        ‘pd‘: ‘synthesis‘,
        ‘timestamp‘: ‘1124216535987‘
    }
    url = ‘https://www.toutiao.com/api/search/content/?‘ + urlencode(data)  # 字典對象轉化url對象
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print(‘請求索引頁失敗‘)
        return None

def parse_page_index(html):
    try:
        data = json.loads(html)  # 轉化為json對象
        if data and ‘data‘ in data.keys():
            # print(data.keys()) #調試，輸出所有key
            for item in data.get(‘data‘):
                if ‘article_url‘ in item:  # 判斷是否存在，避免出現None
                    # print(item)
                    yield item.get(‘article_url‘)  # 構造生成器
    except JSONDecodeError:
        pass
    except TypeError:
        pass

def get_page_detail(url):
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print(‘請求詳情頁出錯‘, url)
        return None

def parse_page_detail(html, url):
    pattern = re.compile("articleInfo:.*?title:\s‘(.*?)‘,.*?content:\s‘(.*?)‘.*?groupId", re.S)
    result = re.findall(pattern, html)
    # print(tc)
    if result:
        title, content = result[0]
        pattern = re.compile("(http://.*?)"", re.S)
        images = re.findall(pattern, content)
        # print(img)
        for image in images: download_image(image, title)
        # print(item)
        return {
            ‘title‘: title,
            ‘url‘: url,
            ‘images‘: images
        }
    else:
        pattern = re.compile(‘BASE_DATA.galleryInfo.*?title:\s\‘(.*?)\‘.*?gallery: JSON.parse\("(.*)"\)‘, re.S)
        result = re.findall(pattern, html)
        # print(result[0])
        if result:
            title, content = result[0]
            data = json.loads(content.replace(‘\\‘, ‘‘))
            # print(data)
            if data and ‘sub_images‘ in data.keys():
                sub_images = data.get(‘sub_images‘)
                images = [item.get(‘url‘) for item in sub_images]
                for image in images: download_image(image,title)
                return {
                    ‘title‘: title,
                    ‘url‘: url,
                    ‘images‘: images
                }

def save_to_mongo(result):
    if db[MONGO_TABLE].insert(result):
        print(‘存儲到MongoDB成功‘, result)
        return True
    return False

def download_image(url,title):
    print(‘正在下載‘, url)
    try:
        response = requests.get(url)
        if response.status_code == 200:
            save_image(response.content,title)
        return None
    except RequestException:
        print(‘請求圖片出錯‘, url)
        return None

def save_image(content,title):
    try:
        if title:
            title = re.sub(‘[:?！!：？]‘, ‘‘, title)  # 替換title中的特殊字符，避免建立資料夾目錄出錯
        dir = ‘z:\\toutiao\\‘
        if os.path.exists(dir + title):
            pass
        else:
            os.mkdir(dir + title)
        file_path = ‘{0}/{1}.{2}‘.format( dir + title, md5(content).hexdigest(), ‘jpg‘)
        if not os.path.exists(file_path):
            with open(file_path, ‘wb‘) as f:
                f.write(content)
                f.close()
    except OSError:
        pass

def main(offset):
    html = get_page_index(offset, KEYWORD)
    for url in parse_page_index(html):
        print(url)
        html = get_page_detail(url)
        if html:
            result = parse_page_detail(html, url)
            if result:
                save_to_mongo(result)

    # print(html)

if __name__ == ‘__main__‘:
    # main()
    groups = [x * 20 for x in range(GROUP_START, GROUP_END + 1)]
    pool = Pool()
    pool.map(main, groups)
    pool.close()
    pool.join()

下載圖片，並保存到mongodb

技術分享圖片

多線程高容錯爬頭條街拍美圖

win64 The 容錯改版 hash keys lac exception times 分析頭條的ajax,通過正則表達式，python3多線程高容錯爬取頭條的街拍美圖，保存到mongodb,並下載圖片頭條的內容網頁較之前已經改版，圖床頁不僅有ajax的還有html的

Ajax爬取今日頭條街拍美圖

1.開啟今日頭條：https://www.toutiao.com 2.搜尋街拍 3.檢查元素，檢視請求發現在URL中每次只有offset發生改變，是一個get請求 1 import requests 2 from urllib.parse import urlencode 3 impor

分析Ajax爬取今日頭條街拍美圖-崔慶才思路

站點分析原始碼及遇到的問題程式碼結構方法定義需要的常量關於在程式碼中遇到的問題01. 資料庫連線02.今日頭條的反爬蟲機制03. json解碼遇到的問題04. 關於response.text和response.content的區別原始碼站點分析首先,開啟頭條,在搜尋框輸入關鍵字之後,在返回的

分析Ajax抓取今日頭條街拍美圖

resp exce ret splay pattern hashlib multi re.search clas spider.py 1 # -*- coding:utf-8 -*- 2 from urllib import urlencode 3 impo

分析Ajax請求並抓取今日頭條街拍美圖

mage param word esp 信息 ons import src on() 準備工作 requests、Beautiful Soup、MongoDB 抓取分析在抓取之前首先分析抓取的邏輯，打開今日頭條的首頁https://www.toutiao.com/如

多線程+隊列爬取雙色球福利彩票歷史數據

sta chrome 雙色球 get page ror pad utf 爬取 #!/usr/bin/python -- coding:UTF-8 -- @Author : Anic.Mo @Time : 2018/6/18 12:51 @File : sc

用單進程、多線程並發、多線程分別實現爬一個或多個網站的所有鏈接，用瀏覽器打開所有鏈接並保存截圖 python

app imp mat 並發執行 cut h+ chrome 鏈接目錄 #coding=utf-8import requestsimport re,os,time,ConfigParserfrom selenium import webdriverfrom multipr

多線程Beatiful Soup爬取鬥魚所有在線主播的信息

category con 讀取教程 stc https rom webkit date 　　最近看了個爬蟲的教程，想著自己也常在鬥魚看直播，不如就拿它來練練手。於是就寫了個爬取鬥魚所有在線主播的信息，分別為類別、主播ID、房間標題、人氣值、房間地址。　　需要用到的工具p

java多線程高並發

。。 his begin post imp 並發 blog 請求 asd 旭日Follow_24 的CSDN 博客，全文地址請點擊： https://blog.csdn.net/xuri24/article/details/81293321 “高並發和多線程”總是被一起提

SpringBoot實戰實現分布式鎖一之重現多線程高並發場景

-a 數據庫表創建 book 前言分解 bind result 上下實戰前言：上篇博文我總體介紹了我這套視頻課程：“SpringBoot實戰實現分布式鎖” 總體涉及的內容，從本篇文章開始，我將開始介紹其中涉及到的相關知識要點，感興趣的小夥伴可以關註關註學習學習！！工欲

通過分析Ajax請求抓取【今日頭條】“街拍”美圖

有一些網頁直接請求得到的HTML程式碼並沒有在網頁中看到的內容，因為一些資訊是通過Ajax載入，並通過js渲染生成的，這時就需要通過分析網頁的請求來獲取想要爬取的內容。本文通過抓取今日頭條街拍美圖講解一下具體操作步驟。網路庫：Requests 解析庫：BeautifulSoup+正則表

通過分析Ajax請求抓取【今日頭條】街拍美圖

有一些網頁直接請求得到的HTML程式碼並沒有在網頁中看到的內容，因為一些資訊是通過Ajax載入，並通過js渲染生成的，這時就需要通過分析網頁的請求來獲取想要爬取的內容。本文通過抓取今日頭條街拍美圖講解一下具體操作步驟。網路庫：Requests 解析庫：Beau

Python爬取貓眼top100排行榜數據【含多線程】

代碼 status log col return map result port htm # -*- coding: utf-8 -*- import requests from multiprocessing import Pool from requests.e

Java高級特性系列--多線程

lock 技術分享 star 啟動 .com 對象 blog 狀態 es2017 多線程相關概念：線程的5種狀態： 1，新建狀態（New）：線程對象被創建之後，就進入了新建狀態。Thread thread = new Thread(); 2, 就緒狀態（Runnabl

多線程版爬取故事網

實現 exe don comm value obj nco result nic 前言：為了能以更高效的速度爬取，嘗試采用了多線程本博客參照代碼及PROJECT來源：http://kexue.fm/archives/4385/ 源代碼： 1 #! -*- cod

多線程爬取百度百科

lib item put 腳本 mit sin find client rtl 前言：EVERNOTE裏的一篇筆記，我用了三個博客才學完...真的很菜...百度百科和故事網並沒有太過不一樣，修改下編碼，debug下，就可以爬下來了，不過應該是我爬的東西太初級了，而且我爬到

C#多線程和高並發

tps 似的反序 ans otn mage 位置 .cn prot 在任何 TCP Server 的實現中，一定存在一個 Accept Socket Loop，用於接收 Client 端的 Connect 請求以建立 TCP Connection。在任何 TCP Se

python采用多進程/多線程/協程寫爬蟲以及性能對比，牛逼的分分鐘就將一個網站爬下來!

分配返回 afa 一個同方 except erer 簡單 direct 首先我們來了解下python中的進程，線程以及協程！從計算機硬件角度：計算機的核心是CPU，承擔了所有的計算任務。一個CPU，在一個時間切片裏只能運行一個程序。從操作系統的角度：進程

使用線程池多線程爬取鏈接，檢驗鏈接正確性

需求完成 cep gen -- 開始獲取url tool 可能我們網站大多數鏈接都是活鏈接都是運營配置的，而有的時候運營會將鏈接配置錯誤使訪問出錯，有時也會因為程序bug造成訪問出錯，因此對主站寫了個監控腳本，使用python爬取主站設置的鏈接並訪問，統計訪

Java高級－解析Java中的多線程機制

分配優先恢復需要 java應用程序成員變量函數分布式方法線程的狀態控制在這裏需要明確的是：無論采用繼承Thread類還是實現Runnable接口來實現應用程序的多線程能力，都需要在該類中定義用於完成實際功能的run方法，這個run方法稱為線程體（Th

多線程高容錯爬頭條街拍美圖

相關推薦