Python批量爬取小說

阿新 • • 發佈：2018-12-31

利用BeautifulSoup批量爬取筆趣閣小說。

from bs4 import BeautifulSoup
import urllib.request
import re
import os
import threading
import time
# 通過爬蟲爬取一本小說

base_url = 'http://www.qu.la' # 筆趣閣首頁網址

class myThread (threading.Thread):   #繼承父類threading.Thread
    def __init__(self, threadID, counter,start_page):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.counter = counter
        self.start_page=start_page
        self.bookname, self.url, self.first_url = get_book_by_id(self.counter,self.start_page)
    def run(self):                   #把要執行的程式碼寫到run函式裡面 執行緒在建立後會直接執行run函式
        get_chapter_content(self.bookname, self.url, self.first_url)

def get_book_by_id(counter, start_page):
    url = base_url + '/book/' + str(counter + start_page) + '/'
    html_res = urllib.request.urlopen(url)
    soup = BeautifulSoup(html_res, 'html.parser')
    info = soup.select('#wrapper .box_con #maininfo #info')[0]
    bookname = info.contents[1].string
    writer = info.find('p').string
    latest = info.find_all('p')[2].string # 最後更新
    newest = info.find_all('p')[3] # 最新章節
    intro = soup.select('#wrapper .box_con #maininfo #intro')[0].text
    introduction = u"{0}\n{1}\n{2}\n{3}\n{4}\n".format(bookname, writer, latest, newest, intro, intro)
    fw = open("{}.txt.download".format(bookname), 'w', encoding='utf-8')
    fw.write(introduction)
    # 找到第一章的href開始下載
    contents = soup.select('#wrapper .box_con #list dl dt')
    #first_url.find_all('dt')
    for content in contents:
        if str(content).__contains__('一') or str(content).__contains__('正文'):
            start = content
    first_href = start.findNextSibling('dd').contents[1]['href']
    first_url = base_url + first_href
    return bookname, url, first_url


def get_chapter_content(bookname, url, chapter_url):
    fa = open("{0}.txt.download".format(bookname), 'a', encoding='utf-8')
    while(True):
        try:
            html_ret = urllib.request.urlopen(chapter_url, timeout=15).read()
        except:
            continue
        soup = BeautifulSoup(html_ret, 'html.parser')
        chapter = soup.select('#wrapper .content_read .box_con .bookname')[0]
        chapter_url = chapter.findAll('a')[2]['href']
        chapter_name = chapter.h1.string
        chapter_content = soup.select('#wrapper .content_read .box_con #content')[0].text
        chapter_content = re.sub('\s+', '\r\n\t', chapter_content).strip('\r\n')
        fa.write(chapter_name)
        fa.write(chapter_content)
        if chapter_url == "./":
            break
        chapter_url = url + chapter_url
    os.rename('{}.txt.download'.format(bookname), '{}.txt'.format(bookname))
    print("{}.txt下載完成".format(bookname))


#批量獲取txt  900-1000
def get_txts(start_page):
    threads = []
    print("當前起始頁面：" + str(start_page))
    print("===============建立下載任務====================")
    for i in range(start_page, start_page+10):
        thread_one = myThread(i, i, start_page)
        thread_one.start()
        threads.append(thread_one)
    print("================下載任務建立完成================")
    print("================等待下載任務完成================")
    task_num = len(threads)
    count = 0
    while (1):
        os.system('clear')
        print('============{0:0>8}-{1:0>8} '.format(start_page, start_page + 10) + "下載中===========")
        run_task = 0
        for thread in threads:
            if (thread.isAlive()):
                run_task += 1
                print('{}下載中'.format(thread.bookname))
            else:
                print('{}下載完成'.format(thread.bookname))
        print('\b'+"總任務數：" + str(task_num) + "  已完成任務數：" + str(task_num - run_task)+"\r")
        if (run_task == 0):
            break
        time.sleep(1)
        if (count > 100000):
            count = 0
        else:
            count += 1
    os.system('clear')
    print("所有下載任務已完成")
    time.sleep(2)

if __name__ == "__main__":
    get_txts(20)

執行結果圖：
在這裡插入圖片描述

利用Python批量爬取XKCD動漫圖片，並批量儲存

import requests, os, bs4 url = 'https://xkcd.com' os.makedirs('xkcd',exist_ok = True) while not url.endswith('#'): # download the page

Python批量爬取堆糖網圖片

import urllib.parse import requests #第三方請求庫 import json import jsonpath #處理json檔案的的提取庫 from bs4 import BeautifulSoup import os im

Python批量爬取小說

利用BeautifulSoup批量爬取筆趣閣小說。 from bs4 import BeautifulSoup import urllib.request import re import os import threading import time # 通過

用Python批量爬取妹紙圖片

通過Python編寫爬蟲，批量爬取妹紙圖片，本文的爬蟲實現爬取妹子圖網站（http://www.mzitu.com/zipai/）中妹子自拍欄目中所有妹子的圖片。開啟自拍欄目地址http://www.mzitu.com/zipai/後，我們發現當前頁面預

python 批量爬取部落格資料(僅供學習)

#coding:utf-8 import urllib import time import os page=1 while page<=7: url=['']*50

使用python-requests+Fiddler4+appium爬蟲,批量爬取抖音小視訊

抖音很火，大家都知道，樓主決定使用python爬取抖音小視訊，人家都說天下沒有爬不到的資料，so，樓主決定試試水，純屬技術愛好，分享給大家。。 1.樓主首先使用Fiddler4來抓取手機抖音app這個包，具體配置的操作，網上有很多教程供大家參考。上面得出抖音的視訊的url，這些url均能在網頁中

Python爬蟲入門 | 5 爬取小豬短租租房信息

圖片交流 ffffff 信息 jpg http 而已基本 mat 小豬短租是一個租房網站，上面有很多優質的民宿出租信息，下面我們以成都地區的租房信息為例，來嘗試爬取這些數據。小豬短租（成都）頁面：http://cd.xiaozhu.com/1.爬取租房標題按照慣例，

python+selenium批量爬取IEEExplore論文

原文出處：https://blog.csdn.net/qq_25072387/article/details/78588173 一、環境搭建首先下載安裝selenium包，推薦直接使用pip 之後還要下載對應瀏覽器的驅動（driver)，這裡使用的是chrome瀏覽器

python爬蟲爬取全站url，完美小demo（可防止連結到外網等各種強大篩選）

上次完成的url爬取專案並不能滿足需求，在此完成了一個更為強大的爬取程式碼，有需要的可以直接執行，根據自己爬取的網站更改部分正則和形參即可。前排提示：執行需要耐心，因為幾千個url爬完的話，還是建議花生瓜子可樂電影準備好。話不多說，直接上程式碼，程式碼有註釋，很容易理解。

python爬取小視訊——梨視訊

爬取梨視訊小視訊網址：http://www.pearvideo.com/ 工具：python3，pycharm，火狐瀏覽器（或谷歌瀏覽器）模組：requests，re，os， urllib.request，（如需控制爬取速度，可加入time模組。）思路：

使用python爬蟲,批量爬取抖音app視訊

使用python爬蟲,批量爬取抖音app視訊（requests+Fiddler+appium）抖音很火，樓主使用python隨機爬取抖音視訊，並且無水印下載，人家都說天下沒有爬不到的資料，so，樓主決定試試水，純屬技術愛好，分享給大家。。 1.樓主首先使用Fiddler4來抓取手機抖音

WPF資料爬取小工具－某寶推廣位批量生成，及訂單爬取記：接單最痛一次的感悟

專案由來：上月閒來無事接到接到一個單子，自動登入　X寶平臺，然後重定向到指定頁面批量生成推廣位資訊；與此同時自動定時同步訂單資料到需求提供方的Java服務。當然期間遇到一個小小的問題就是介面樣式的問題，起初使用的ｗｉｎｆｏｒｍ開發，但是樣式，你懂的，所以後來索性直接使用ｗｐｆ．先宣告：這裡只做經驗分享

python爬蟲實踐——零基礎快速入門（四）爬取小豬租房資訊

接下來我們爬取小豬短租租房資訊。進入主頁後選擇深圳地區的位置。地址如下： http://sz.xiaozhu.com/ 一，標題爬取按照慣例，我們先複製標題的xpath資訊，多複製幾個進行對比： //*[@id="page_list"]/ul/li[1]/

百行程式碼，python爬取小姐姐網100G套圖，別流鼻血，身體重要！

前言最近在做監控相關的配套設施，發現很多指令碼都是基於Python的。很早之前就聽說其大名，人生苦短，我學Python，這並非一句戲言。隨著人工智慧、機器學習、深度學習的崛起，目前市面上大部分的人工智慧的程式碼大多使用Python 來編寫。所以人工智

python獲取網頁page數，同時按照href批量爬取網頁（requests+BeautifulSoup）

本篇部落格是上篇部落格（http://blog.csdn.net/trisyp/article/details/78732630）的傳參版，即通過html元素獲取頁面的所有href，然後逐個爬取完整程式碼如下： import requests from bs4 impo

Python爬蟲入門 | 5 爬取小豬短租租房資訊

小豬短租是一個租房網站，上面有很多優質的民宿出租資訊，下面我們以成都地區的租房資訊為例，來嘗試爬取這些資料。 1.爬取租房標題按照慣例，先來爬下標題試試水，找到標題，複製xpath。多複製幾個房屋的標題 xpath 進行對比：

基於Python3.6寫的自助翻譯小軟體--使用google translate的介面，Python實現爬取google翻譯API結果，並打包成.exe的可執行檔案

看文獻看的頭疼，為了解決小麻煩沒事就寫了這個來玩一玩。其實也沒有什麼就是用一個簡單的爬蟲和介面，所以啥也不多說，直接貼程式碼，歡迎嘗試# -*- coding: utf-8 -*- # filename:GoogleTranslation1.2.py import urll

python requests庫網頁爬取小實例：百度/360搜索關鍵詞提交

ext aid col text () status exc print 爬取百度/360搜索關鍵詞提交全代碼： #百度/360搜索關鍵詞提交import requestskeyword=‘Python‘try: 　　#百度關鍵字　　# kv={‘w

Python爬蟲-爬取糗事百科段子

hasattr com ima .net header rfi star reason images 閑來無事，學學python爬蟲。在正式學爬蟲前，簡單學習了下HTML和CSS，了解了網頁的基本結構後，更加快速入門。 1.獲取糗事百科url http://www.qiu

Python批量爬取小說

相關推薦