Python3爬蟲之五：爬取網站資料並寫入excel

阿新 • • 發佈：2019-02-05

本文主要講解如何將網頁上的資料寫入到excel表中，因為我比較喜歡看小說，我們就以筆趣閣的小說資料為例，來說明怎麼把筆趣閣的小說關鍵資訊統計出來，比如：小說名、字數、作者、網址等。
根據之前的幾次爬蟲例項分析筆趣網原始碼知道，小說名在唯一的標籤h1中，因此可以通過h1.get_txt()得到小說名，作者在meta標籤，property=”og:novel:author”中，可以通過html.find_all(‘meta’,property=”og:novel:author”)獲取到包含該資訊的列表，其他資訊也可同樣得到。
這裡要用到的BeautifulSoup庫、處理讀excel的xlrd

庫、寫入excel的xlwt庫、負責excel複製的xlutils庫。

程式碼：

#coding:utf-8
import os
import sys
import re
from bs4 import BeautifulSoup
from urllib import request
import xlrd 
# from xlwt import *
import xlwt
from xlutils.copy import copy
#from datetime import datetime
url = 'http://www.biqiuge.com/book/37708/'
url = 'http://www.biqiuge.com/book/' 

def getHtmlTree(url):
    webPage = request.urlopen(url)
    htmlCode = webPage.read()
    htmlTree = BeautifulSoup(htmlCode,'html.parser')
    return htmlTree
# xlsName = r'2.xls'
#判斷網頁是否存在
def adjustExist(url):
    try:  
        htmlTree=getHtmlTree(url)
        title = htmlTree.h1.get_text()
        author = htmlTree.find_all('meta' 
,property="og:novel:author")
        author = author[0]['content']
        txtSize = htmlTree.find('div',id='info')
        txtSize = txtSize.find_all('p')
        txtSize = str(txtSize)
        flag1 = txtSize.find('共')
        flag2 = txtSize.find('字')
        if -1 == flag1 or -1 == flag2:
            txtSize = ''
        else: 
            txtSize = txtSize[flag1:flag2+1]
        if u'出現錯誤！-筆趣閣' == title:
            print(url + '    不存在！')
        else:
            print(url)
    except:
        author = 'fbl'
        txtSize = '0 bytes'
        title = 'Unknow'
        pass
    finally:
        return (author,txtSize ,title)
def main():
    reWriteFlag = False
    start_url = 6000
    end_url = 30000
    if start_url > end_url:
        (end_url,start_url) = (start_url,end_url)
    # start_url = 40000
    # end_url = 40001
    #init = [u'序號',u'小說名',u'字數',u'作者',u'路徑']
    # workbook = xlwt.Workbook(encoding = 'utf-8')
    # data_sheet = workbook.add_sheet(u'筆趣閣小說')
    fileName = u'筆趣閣.xls'
    workbook = xlrd.open_workbook(fileName,formatting_info=True)
    # newBook = copy(workbook)
    # data_sheet = newBook.get_sheet(u'筆趣閣小說')
    if reWriteFlag:
        # old_sheet = workbook.sheet_by_name(u'筆趣閣小說')
        newBook = copy(workbook)
        data_sheet = newBook.get_sheet(u'筆趣閣小說')
        for i in range(len(init)):
            data_sheet.write(0,i,init[i])
        newBook.save(fileName)
    for j in range(start_url,end_url):
        workbook = xlrd.open_workbook(fileName,formatting_info=True)
        table = workbook.sheets()[0]
        try:
            cell_value = table.cell(j,0).value
            # print(type(cell_value))
            if cell_value != '':
                print(cell_value)
                continue
        except:
            print('NLL')
            pass
        url_tmp = url + str(j)
        (author,size,title) = adjustExist(url_tmp)
        tmp = [j,title,size,author,url_tmp]
        newBook = copy(workbook)
        data_sheet = newBook.get_sheet(u'筆趣閣小說')
        # data_sheet = newBook.sheet_by_name(u'筆趣閣小說')
        # print(cell_value)
        for k in range(len(tmp)):
            data_sheet.write(j,k,tmp[k])
        newBook.save(fileName)
main()

效果圖展示：
這裡寫圖片描述

在通過excel的資料分列功能可以將字數提取出來作為關鍵資料：
這裡寫圖片描述
有需要這份資料的請去我的資源下載，資源名：筆趣閣小說資料彙總.xls

Python3爬蟲之五：爬取網站資料並寫入excel

Python3爬蟲之五：爬取網站資料並寫入excel

python爬蟲十五：爬取12306火車票資訊

Python爬蟲之五：抓取智聯招聘基礎版

python3 scrapy框架crawl模版爬取京東產品並寫入mysql

python爬蟲：爬取網站視頻

Python3.5：爬取網站上電影數據

Python爬蟲：爬取網站電影資訊

爬蟲學習之17：爬取拉勾網網招聘資訊（非同步載入+Cookie模擬登陸）

爬蟲學習之11：爬取豆瓣電影TOP250並存入資料庫

Java爬蟲系列之實戰：爬取酷狗音樂網 TOP500 的歌曲(附原始碼)

Python爬蟲之利用BeautifulSoup爬取豆瓣小說（三）——將小說信息寫入文件

爬蟲+詞雲：爬取豆瓣電影top100的導演制作圖雲

爬蟲任務二：爬取(用到htmlunit和jsoup)通過百度搜索引擎關鍵字搜取到的新聞標題和url，並保存在本地文件中（主體借鑒了網上的資料）

Python爬蟲系列 - 初探：爬取旅遊評論

scrapy爬蟲框架（三）：爬取桌布儲存並命名

Python爬蟲系列 - 初探：爬取新聞推送

Python：爬蟲例項2：爬取貓眼電影——破解字型反爬

Python爬蟲實例：爬取B站《工作細胞》短評——異步加載信息的爬取

手把手教你利用前端字型檔案(.ttf)混淆數字來阻止爬蟲爬取網站資料

【爬蟲小程式：爬取鬥魚所有房間資訊】Xpath(執行緒池版)

Python3爬蟲之五：爬取網站資料並寫入excel

相關推薦