1. 程式人生 > >python爬蟲(爬取蜂鳥網高畫素圖片)_空網頁,錯誤處理

python爬蟲(爬取蜂鳥網高畫素圖片)_空網頁,錯誤處理

__author__ = 'AllenMinD'
import requests,urllib,os
from bs4 import  BeautifulSoup

ans = 1 #counting

for page in range(0,43):
    flag = 1 #web exists or not
    if page<10:
        url = 'http://bbs.fengniao.com/forum/pic/slide_101_8903443_8017670'+str(page)+'.html'
    else:
        url = 'http://bbs.fengniao.com/forum/pic/slide_101_8903443_801767'+str(page)+'.html'
    source_code = requests.get(url)
    plain_text = source_code.text

    soup = BeautifulSoup(plain_text,'lxml')

    file_name = ''
    download_link = []
    for pic_tag in soup.find_all('a'):
        if pic_tag.get('href') == '/forum/8903443.html':
            file_name  = pic_tag.get('title')
        if pic_tag.get('class') == ['pictureDownload']:
            if pic_tag.get('href') == '': #if this page is None
                flag = 0
                break
            else:
                download_link.append(pic_tag.get('href'))

    if flag == 0 : #this page is None
        continue

    folder_path = 'D:/spider_things/2016.4.8/' + file_name + '/'

    if not os.path.exists(folder_path):
        os.makedirs(folder_path)

    for item in download_link:
        try:
            urllib.urlretrieve(item,folder_path + str(ans) + '.jpg')
            print 'you have downloaded' , ans , 'pic(s)'
            ans = ans + 1
        except urllib.ContentTooShortError,e: #if the picture is too big , pass it
            continue

這次同樣是爬去蜂鳥網的圖片,但是中途遇到了2個新問題:

1. 空網頁:

蜂鳥網的有些圖片集的圖片連線不是連號的,這時候就要用一個if語句來跳過一些沒有圖片的連線

if pic_tag.get('href') == '': #if this page is None
                flag = 0
                break
.....

if flag == 0 : #this page is None
        continue

2.錯誤處理

爬取這次圖片的時候發現,有些圖片太大了,超出了urllib.urlretrieve方法所規定的範圍,即出現報錯:urllib.ContentTooShortError

這時候,要利用try...except 來處理

try...except的格式是:

try:
    ......
except 錯誤型別(如urllib.ContentTooShortError),e:
    ......