Python爬取網頁所有小說

阿新 • • 發佈：2018-12-03

Python爬取網頁所有小說

python 2.7.15
練習beautifulsoup的使用
不瞭解bs的可以先看一下這個bs文件

一、看URL的規律

因為是要爬取網頁上所有的小說，所以不僅要獲取網頁的URL，還要獲取網頁裡的連線們的URL。它們一般是有規律的，如果沒有的話就用正則或bs抓一個列表出來遍歷。

我找了一個東野圭吾作品集的網站，網址如下：
在這裡插入圖片描述
然後是作品列表，點選圖片或名字都可以進入這個小說的網頁
好了，上原始碼

這是列表裡第一本小說《家信》的相關資訊，它存在一個class為common的a標籤裡，其中的href屬性的值就是這本小說的URL

在這裡插入圖片描述
我們要獲取所有小說的這個值，程式碼如下

url ="http://www.shunong.com/author/311/"
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent': user_agent}
request = urllib2.Request(url, headers=headers)
response = urllib2.urlopen(request)
content = response.read()
soup = BeautifulSoup(content,'html.parser')
a = soup.find_all("a" ,  class_="common")
n = 1
for b in a :
    if n%2==0:
        add = b.get('href').encode('utf-8')
        name = b.get_text().encode('utf-8')
        print name
        book(add, name)
    n+=1

這個if是因為原始碼裡同樣的資訊出現了兩遍，去一個就行
列表裡遍歷出來的值後面一定要加一個encode，因為bs物件都是Unicode，寫到文件裡就成亂碼了
這樣就獲得了這個網頁上的所有小說的網址了，這個book函式就是對書名及其網址的操作

二、爬取章節

還是找URL的規律以《家信》為例
在這裡插入圖片描述
這是原始碼裡全部的章節及其網址
這一部分就是剛才的book函式，這裡我用的正則

def book(add, name):
    url = "http://www.shunong.com" + add
    print url
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = {'User-Agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request)
    content = response.read()
    pattern = re.compile('<li><a href="(.*?)" title=".*?</a></li>')
    items = re.findall(pattern, content)

獲取之後迴圈遍歷執行下一個章節函式，同時開啟一個文字文件準備存這一本小說

    n = 0
    f = open(name.decode('utf-8') + '.txt', 'a')
    for item in items:
        if n>4:
            books = item
            d =chapter(books)
            print d
            #f.writelines(d)
        n+=1
    f.close()

三、爬取內容

得到章節的網址後就可以爬取章節裡的內容了
還是先看原始碼
在這裡插入圖片描述
章節名在h1標籤裡，章節內容在p標籤裡，但是原始碼裡的p標籤不止一個，這就需要我們篩選。先列印所有p標籤，每列印一個，就輸出一些符號，便於觀察，打印出來後看正文是第幾個標籤，這裡是第二個。

 soup = BeautifulSoup(content,'html.parser')
    a = soup.find('h1')
    b = soup.find_all('p')
    a = a.get_text(strip=True)
    a = a.encode('utf-8')
    d = a + '\n'
    c = b[1].get_text(strip=True).encode('utf-8')
    d = d + c
    d = d + '\n'
    return d

獲取到內容後return，遞給上一個函式，寫入文件。

四、完整程式碼

# coding=utf-8
from bs4 import BeautifulSoup
import urllib2
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def chapter(books):
    url = "http://www.shunong.com" + books
    print url
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = {'User-Agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request)
    content = response.read()
    soup = BeautifulSoup(content,'html.parser')
    a = soup.find('h1')
    b = soup.find_all('p')
    a = a.get_text(strip=True)
    a = a.encode('utf-8')
    d = a + '\n'
    c = b[1].get_text(strip=True).encode('utf-8')
    d = d + c
    d = d + '\n'
    return d

def book(add, name):
    url = "http://www.shunong.com" + add
    print url
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = {'User-Agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request)
    content = response.read()
    pattern = re.compile('<li><a href="(.*?)" title=".*?</a></li>')
    items = re.findall(pattern, content)
    n = 0
    f = open(name.decode('utf-8') + '.txt', 'a')
    for item in items:
        if n>4:
            books = item
            d =chapter(books)
            print d
            f.writelines(d)
        n+=1
    f.close()


url ="http://www.shunong.com/author/311/"
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent': user_agent}
request = urllib2.Request(url, headers=headers)
response = urllib2.urlopen(request)
content = response.read()
soup = BeautifulSoup(content,'html.parser')
a = soup.find_all("a" ,  class_="common")
n = 1
for b in a :
    if n%2==0:
        add = b.get('href').encode('utf-8')
        name = b.get_text().encode('utf-8')
        print name
        book(add, name)
    n+=1

Python爬取網頁所有小說