使用python的requests、xpath和多執行緒爬取糗事百科的段子

阿新 • • 發佈：2018-12-13

程式碼主要使用的python中的requests模組、xpath功能和threading多執行緒爬取了糗事百科中段子的內容、圖片和閱讀數、段子作者的性別，年齡和頭像。

# author: aspiring
import requests
from lxml import etree
import json
import threading
from queue import Queue

class QiubaiSpider:
    def __init__(self):
        self.url_temp = "https://www.qiushibaike.com/8hr/page/{}/"
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}
        self.url_queue = Queue()
        self.html_queue = Queue()
        self.content_queue = Queue()

    def get_url_list(self):
        # return [self.url_temp.format(i) for i in range(1,14)]
        for i in range(1,14):
            self.url_queue.put(self.url_temp.format(i))

    def parse_url(self):
        while True:
            url = self.url_queue.get()
            print(url)
            response = requests.get(url,headers=self.headers)
            # return response.content.decode()
            self.html_queue.put(response.content.decode())
            self.url_queue.task_done()

    def get_content_list(self): # 提取資料
        while True:
            html_str = self.html_queue.get()
            html = etree.HTML(html_str)
            div_list = html.xpath("//div[@id='content-left']/div") # 分組
            content_list = []
            for div in div_list:
                item = {}
                item["content"] = div.xpath(".//div[@class='content']/span/text()")
                item["content"] = [i.replace("\n","") for i in item["content"]]
                item["author_gender"] = div.xpath(".//div[contains(@class,'articleGender')]/@class")
                item["author_gender"] = item["author_gender"][0].split(" ")[-1].replace("Icon", "") if len(item["author_gender"])>0 else None
                item["author_age"] = div.xpath(".//div[contains(@class,'articleGender')]/text()")
                item["author_age"] = item["author_age"][0] if len(item["author_age"]) >0 else None
                item["content_img"] = div.xpath(".//div[@class='thumb']/a/img/@src")
                item["content_img"] = "https:"+item["content_img"][0] if len(item["content_img"]) >0 else None
                item["author_img"] = div.xpath(".//div[@class='author clearfix']//img/@src")
                item["author_img"] = "https:"+item["author_img"][0] if len(item["author_img"])>0 else None
                item["stats_vote"] = div.xpath(".//span[@class='stats-vote']/i/text()")
                item["stats_vote"] = item["stats_vote"][0] if len(item["stats_vote"])>0 else None
                content_list.append(item)
            # return content_list
            self.content_queue.put(content_list)
            self.html_queue.task_done()

    def save_content_list(self): # 儲存
        while True:
            content_list = self.content_queue.get()
            with open("qiubai_queue.txt", "a", encoding="utf-8") as f:
                for content in content_list:
                    f.write(json.dumps(content, ensure_ascii=False, indent=2))
                    f.write("\n")
            print("儲存成功")
            self.content_queue.task_done()



    def run(self): # 實現主要邏輯
        thread_list = []
        #1.url_list
        t_url = threading.Thread(target=self.get_url_list)
        thread_list.append(t_url)
        #2.遍歷，傳送請求，獲取響應
        for i in range(20):
            t_parse = threading.Thread(target=self.parse_url)
            thread_list.append(t_parse)
        #3.提取資料
        for i in range(2):
            t_html = threading.Thread(target=self.get_content_list)
            thread_list.append(t_html)
        #4.儲存
        t_save = threading.Thread(target=self.save_content_list)
        thread_list.append(t_save)

        for t in thread_list:
            t.setDaemon(True) # 把子執行緒設定為守護執行緒，該執行緒不重要 主執行緒結束，子執行緒結束
            t.start()

        for q in [self.url_queue,self.html_queue,self.content_queue]:
            q.join() # 讓主執行緒等待阻塞，等待對列的任務完成之後再完成

        print("主執行緒結束")

if __name__ == '__main__':
    qiubai_spider = QiubaiSpider()
    qiubai_spider.run()

使用python的requests、xpath和多執行緒爬取糗事百科的段子

程式碼主要使用的python中的requests模組、xpath功能和threading多執行緒爬取了糗事百科中段子的內容、圖片和閱讀數、段子作者的性別，年齡和頭像。 # author: aspiring import requests from lxml import

使用threading,queue,fake_useragent,requests ,lxml,多執行緒爬取嗅事百科13頁文字資料,爬蟲案例

#author:huangtao # coding=utf-8 #多執行緒庫 from threading import Thread #佇列庫 from queue import Queue #請求庫 from fake_useragent import UserAgent

案例_(多線線程)爬取糗事百科

false 內容圖片 nbsp strip 5.0 mpat 交流 strong 1 # 使用了線程庫 2 import threading 3 # 隊列 4 from queue import Queue 5 # 解析庫 6 from lxml

NO.33——XPath選擇器爬取糗事百科段子

程式碼實戰： # -*- coding:utf-8 -*- import urllib import requests import re import chardet from lxml import etree page = 2 url = 'ht

python—多協程爬取糗事百科熱圖

wow64 monk 根據 list 網址 real span 本地 uil 今天在使用正則表達式時未能解決實際問題，於是使用bs4庫完成匹配，通過反復測試，最終解決了實際的問題，加深了對bs4.BeautifulSoup模塊的理解。爬取流程前奏：分析糗事百科熱圖板塊

Python爬蟲-爬取糗事百科段子

hasattr com ima .net header rfi star reason images 閑來無事，學學python爬蟲。在正式學爬蟲前，簡單學習了下HTML和CSS，了解了網頁的基本結構後，更加快速入門。 1.獲取糗事百科url http://www.qiu

Python 爬取糗事百科段子

爬蟲 Python 百科段子直接上代碼 #!/usr/bin/env python # -*- coding: utf-8 -*- import re import urllib.request def gettext(url,page): headers=("User-Agen

Python :爬取糗事百科段子

原始碼： import urllib import random def JokeSet(Url,UserAgent) ''' Url ：動態url網址 UserAgent :動態請求頭 ''' #設定請求頭 Headers ={ "User-Agent" : UserAgent

用BeautifulSoup爬取糗事百科段子

from bs4 import BeautifulSoup import lxml import requests import html import time import html5lib import re def crawl_joke_list_usebs4(pag

java佇列、棧和多執行緒結合使用的例子

剛來公司幾天，無意中聽到其他的開發組有用到佇列這個知識點，我沒用過也沒了解過，今天花時間補補這塊知識，整理了網上的一些資料。佇列其實所指生活中排隊的現象，去商場購物，付款時需要排隊，買飯時需要排隊，好多事情都是需要排隊，排在第一位的則先處理，結束後，後面的人都

《技術男征服美女HR》—Fiber、Coroutine和多執行緒那些事

![img](https://img.nopassby.com/20201204/11l6_c24aad10870a28296f28558b39e6c155_900x436.jpeg) ## 1、起點我叫小白，坐在這間屬於華夏國超一流網際網路公司企鵝巴巴的小會議室裡，等著技術面試官的到來。令我感到不

spider----利用多執行緒爬取51job案例

程式碼如下 import json from threading import Thread from threading import Lock from queue import Queue import requests from bs4 import BeautifulSoup i

Jsoup簡單例子2.0——多執行緒爬取網頁內的郵箱

上一篇文章講了利用Jsoup爬取貼吧帖子裡的郵箱，雖然爬取成功了，但我對效率有所追求。10頁的帖子爬取了兩百多個郵箱，最快用時8秒，一般需要9秒。在思考了一下怎麼提升效率後，決定採用多執行緒的方式爬取網頁內的郵箱。廢話不多說，直接上程式碼。引入Jsoup的jar包此處省略，沒有的可以檢視上篇文

【Python3爬蟲-爬圖片】多執行緒爬取中國國家地理全站美圖，多圖可以提高你的審美哦

宣告：爬蟲為學習使用，請各位同學務必不要對當放網站或i伺服器造成傷害。務必不要寫死迴圈。 - 思路：古鎮——古鎮列表（迴圈獲取古鎮詳情href）——xx古鎮詳情（獲取所有img的src） - 1. 單分類爬： from bs4 import BeautifulSo

Python爬蟲從入門到精通(3): BeautifulSoup用法總結及多執行緒爬蟲爬取糗事百科

本文是Python爬蟲從入門到精通系列的第3篇。我們將總結BeautifulSoup這個解析庫以及常用的find和select方法。我們還會利用requests庫和BeauitfulSoup來爬取糗事百科上的段子, 並對比下單執行緒爬蟲和多執行緒爬蟲的爬取效率。什麼是

Python爬蟲入門教程 10-100 圖蟲網多執行緒爬取

寫在前面經歷了一頓噼裡啪啦的操作之後，終於我把部落格寫到了第10篇，後面，慢慢的會涉及到更多的爬蟲模組，有人問scrapy 啥時候開始用，這個我預計要在30篇以後了吧，後面的套路依舊慢節奏的，所以莫著急了，100篇呢，預計4~5個月寫完，常見的反反爬後面也會寫的，還有fuck login類的內容。

java redis多執行緒爬取國美商品資訊

前面那篇爬蟲文章用的是單執行緒沒有用到其它一些比較提高效率的工具比較遺憾，所以今天做了一個比較全面的爬蟲。首先謝謝 @[天不生我萬古長](https://www.jianshu.com/u/e34019621ee9)這位小夥伴的留言，不然還真有點懶了。因為上班所以也只能利用

Python爬蟲教程：圖蟲網多執行緒爬取

我們這次也玩點以前沒寫過的，使用python中的queue，也就是佇列下面是我從別人那順來的一些解釋，基本爬蟲初期也就用到這麼多 Python學習資料或者需要程式碼、視訊加Python學習群：960410445 1. 初始化： classQueue.Queue(maxsize)FIFO

python多執行緒爬取網頁

#-*- encoding:utf8 -*- ''' Created on 2018年12月25日 @author: Administrator ''' from multiprocessing.dummy import Pool as pl import csv import requests fr

Python爬蟲入門教程 13-100 鬥圖啦表情包多執行緒爬取

寫在前面今天在CSDN部落格，發現好多人寫爬蟲都在爬取一個叫做鬥圖啦的網站，裡面很多表情包，然後瞅了瞅，各種實現方式都有，今天我給你實現一個多執行緒版本的。關鍵技術點 aiohttp ，你可以看一下我前面的文章，然後在學習一下。網站就不分析了，無非就是找到規律，拼接URL，匹配關鍵點，然後爬取。擼

使用python的requests、xpath和多執行緒爬取糗事百科的段子

相關推薦