python—多協程爬取糗事百科熱圖

阿新 • • 發佈：2019-02-19

wow64 monk 根據 list 網址 real span 本地 uil

今天在使用正則表達式時未能解決實際問題，於是使用bs4庫完成匹配，通過反復測試，最終解決了實際的問題，加深了對bs4.BeautifulSoup模塊的理解。

爬取流程

前奏：

分析糗事百科熱圖板塊的網址，因為要進行翻頁爬取內容，所以分析不同頁碼的網址信息是必要的

具體步驟：

１，獲取網頁內容（urllib.request）# 糗事百科有發爬蟲技術，所以要添加headers,偽裝程瀏覽器

２，解析網頁內容，獲取圖片鏈接（from bs4 import BeautifulSoup）

３，通過圖片鏈接下載圖片（urllib.request），並存儲到本地

備註：

具體的爬取說明在代碼都有詳細的解釋

  1 
 import urllib.request
  2 import requests
  3 from bs4 import BeautifulSoup
  4 # import re
  5 import gevent
  6 from gevent import monkey
  7 import bs4
  8 
  9 monkey.patch_all()
 10 
 11 
 12 def get_html_text(url, raw_html_text, depth):
 13 
 14     # 爬取網頁數據
 15 
 16     # 糗事百科有反爬蟲機制，需要設置請求頭偽裝成瀏覽器 

 17     hd = (‘User-Agent‘,‘Mozilla/5.0(Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Ch    rome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0‘)
 18 
 19     # 創建opener對象
 20     opener = urllib.request.build_opener()
 21     opener.addheaders = [hd]
 22 
 23     # 將opener對象設置為全局
 24     urllib.request.install_opener(opener)
 
 25 
 26     # 翻頁爬取html_text
 27     for i in range(depth):
 28         # 根據對網址的分析，構造符合翻頁時的網址
 29         url_real = url + str(i+1)
 30         try:
 31             html_data = urllib.request.urlopen(url_real).read().decode(‘utf-8‘, ‘ignore‘)
 32             raw_html_text.append(html_data)
 33             # 測試代碼
 34             # print(len(html_data))
 35         except Exception as result:
 36             print(‘錯誤類型：‘, result)
 37     
 38     print(‘提取網頁信息完成...‘)
 39     return raw_html_text
 40     # 測試代碼         
 41     # print(len(raw_html_text))
 42 
 43 
 44 def parser_html_text(raw_html_text, done_img):
 45 
 46     # 對爬取的網頁數據進行遍歷
 47 
 48     for html_text in raw_html_text:
 49         # 使用BeautifulSoup對網頁進行解析
 50         soup = BeautifulSoup(html_text, ‘html.parser‘)
 51         # 使用soup.find_all(‘div‘,‘thumb‘) 查找出每個網頁中所有標簽是div,屬性值是thumb的標簽
 52         # 通過對網頁源代碼的分析，圖片信息都存儲在該標簽下的孫子標簽img中的屬性src中
 53         # 遍歷每個div標簽
 54         for tag in soup.find_all(‘div‘, ‘thumb‘):
 55             # 判斷 tag 是否是bs4.element.Tag屬性，因為在標簽div下，並不是全部是標簽
 56             if isinstance(tag, bs4.element.Tag):
 57                 # 遍歷每個div標簽下的所有孫子標簽
 58                 for img in tag.descendants:
 59                     # 判斷標簽的名字是不是‘img’,如果是,取出標簽中屬性src的屬性值。
 60                     if img.name == ‘img‘:
 61                         link = img.get(‘src‘)
 62                         done_img.append(link)
 63     #測試代碼
 64     #print(done_img)
 65     print(‘網頁解析完成...‘)
 66     return done_img
 67 
 68 def save_crawler_data(done_img):
 69     # 將目標文本存儲到本地‘./’表示當前目錄
 70     path = ‘./img/‘
 71     # enumerate(list) 返回索引及索引對應的列表內的元素
 72     for i,j in enumerate(done_img):
 73         # 分析爬取的鏈接，前面缺少‘https:’,使用字符串拼接
 74         j =‘https:‘ + j
 75         # 通過urllib.request.urlopen()下載圖片
 76         try:
 77             img_data = urllib.request.urlopen(j).read()
 78             path_real = path + str(i+1)
 79             with open(path_real, ‘wb‘) as f:
 80                 f.write(img_data)
 81         except:
 82             continue
 83     print(‘圖片存儲完成‘)
 84 
 85 
 86 def main():
 87     url = ‘https://www.qiushibaike.com/imgrank/page/‘
 88     depth = 20
 89     raw_html_text = list()
 90     done_img = list()
 91     Raw_html_text = get_html_text(url, raw_html_text, depth)
 92     Done_img = parser_html_text(Raw_html_text, done_img)
 93     gevent.joinall([
 94         gevent.spawn(get_html_text,url,raw_html_text,depth),
 95         gevent.spawn(parser_html_text,Raw_html_text,done_img),
 96         gevent.spawn(save_crawler_data,Done_img)
 97         ])
 98 
 99     save_crawler_data(done_img)
100 
101 
102 if __name__ == ‘__main__‘:
103     main()

python—多協程爬取糗事百科熱圖

wow64 monk 根據 list 網址 real span 本地 uil 今天在使用正則表達式時未能解決實際問題，於是使用bs4庫完成匹配，通過反復測試，最終解決了實際的問題，加深了對bs4.BeautifulSoup模塊的理解。爬取流程前奏：分析糗事百科熱圖板塊

案例_(多線線程)爬取糗事百科

false 內容圖片 nbsp strip 5.0 mpat 交流 strong 1 # 使用了線程庫 2 import threading 3 # 隊列 4 from queue import Queue 5 # 解析庫 6 from lxml

使用python的requests、xpath和多執行緒爬取糗事百科的段子

程式碼主要使用的python中的requests模組、xpath功能和threading多執行緒爬取了糗事百科中段子的內容、圖片和閱讀數、段子作者的性別，年齡和頭像。 # author: aspiring import requests from lxml import

Python爬蟲從入門到精通(3): BeautifulSoup用法總結及多執行緒爬蟲爬取糗事百科

本文是Python爬蟲從入門到精通系列的第3篇。我們將總結BeautifulSoup這個解析庫以及常用的find和select方法。我們還會利用requests庫和BeauitfulSoup來爬取糗事百科上的段子, 並對比下單執行緒爬蟲和多執行緒爬蟲的爬取效率。什麼是

Python爬蟲-爬取糗事百科段子

hasattr com ima .net header rfi star reason images 閑來無事，學學python爬蟲。在正式學爬蟲前，簡單學習了下HTML和CSS，了解了網頁的基本結構後，更加快速入門。 1.獲取糗事百科url http://www.qiu

利用python爬取糗事百科的用戶及段子

我們什麽 roo urlopen gen 文件 addheader find 正則匹配最近正在學習python爬蟲，爬蟲可以做很多有趣的事，本文利用python爬蟲來爬取糗事百科的用戶以及段子，我們需要利用python獲取糗事百科一個頁面的用戶以及段子，就需要匹配兩次，

Python 爬取糗事百科段子

爬蟲 Python 百科段子直接上代碼 #!/usr/bin/env python # -*- coding: utf-8 -*- import re import urllib.request def gettext(url,page): headers=("User-Agen

scrapy框架爬蟲爬取糗事百科之 Python爬蟲從入門到放棄第不知道多少天（1）

Scrapy框架安裝及使用 1. windows 10 下安裝 Scrapy 框架：　　前提：安裝了python-pip 　　1. windows下按住win+R 輸入cmd 　　2. 在cmd 下輸入　　　　　　pip install scrapy 　　　　　　pip inst

Python :爬取糗事百科段子

原始碼： import urllib import random def JokeSet(Url,UserAgent) ''' Url ：動態url網址 UserAgent :動態請求頭 ''' #設定請求頭 Headers ={ "User-Agent" : UserAgent

Python爬蟲爬取糗事百科(xpath+re)

爬取糗事百科，用xpath、re提取 =================================================== ===================================================== 1 ''' 2 爬取醜事百科，頁面

python多線程爬取網頁

brush request ext try ems with import append ide #-*- encoding:utf8 -*- ‘‘‘ Created on 2018年12月25日 @author: Administrator ‘‘‘ from mult

python爬取糗事百科資料並儲存到sqlite中，命令列讀出

import requests import sqlite3 from bs4 import BeautifulSoup class QSBK: def __init__(self): self.page=0 self.items=[

多線程爬取百度百科

lib item put 腳本 mit sin find client rtl 前言：EVERNOTE裏的一篇筆記，我用了三個博客才學完...真的很菜...百度百科和故事網並沒有太過不一樣，修改下編碼，debug下，就可以爬下來了，不過應該是我爬的東西太初級了，而且我爬到

爬取糗事百科案例

from random import choice import requests import re user_agents=[ "User-Agent:Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHT

requests爬取糗事百科頁面

requests爬取糗事百科,由於糗事百科是靜態頁面,用簡單的requests即可程式碼如下: import requests import lxml.html class Qiu: def __init__(self, name_, url_base): """

Scrapy框架的應用———爬取糗事百科檔案

專案主程式碼： 1 import scrapy 2 from qiushibaike.items import QiushibaikeItem 3 4 class QiubaiSpider(scrapy.Spider): 5 name = 'qiubai' 6

用BeautifulSoup爬取糗事百科段子

from bs4 import BeautifulSoup import lxml import requests import html import time import html5lib import re def crawl_joke_list_usebs4(pag

NO.33——XPath選擇器爬取糗事百科段子

程式碼實戰： # -*- coding:utf-8 -*- import urllib import requests import re import chardet from lxml import etree page = 2 url = 'ht

使用threading,queue,fake_useragent,requests ,lxml,多執行緒爬取嗅事百科13頁文字資料,爬蟲案例

#author:huangtao # coding=utf-8 #多執行緒庫 from threading import Thread #佇列庫 from queue import Queue #請求庫 from fake_useragent import UserAgent

爬取糗事百科的頁面

import requests class QiuShiBaiKe(): def __init__(self): """ 初始化引數 """ self.url_bash = 'https://www.qiushibaike.

python—多協程爬取糗事百科熱圖

相關推薦