Python3.5:爬取網站上電影數據
阿新 • • 發佈:2017-08-15
x64 沒有 () nbsp 運行 lpar target __init__ doc
首先我們導入幾個pyhton3的庫:
from urllib import request
import urllib
from html.parser import HTMLParser
在Python2和Python3之間一個重要區別就是,在Python2有urllib,urllib2兩個庫,在Python3整合到一起,裏面的函數方式也有一點變,先定義一個函數,將header,url,request,都打包成一個函數方便調用,且看下面代碼:
def print_movies(url): # 偽裝成瀏覽器訪問網站,但其實沒啥用,很容易被中間件檢測出來,但沒有又不行,所以蠻寫吧 header = {‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36‘} # Python3的urllib req = urllib.request.Request(url, headers=header) s = urllib.request.urlopen(req) parser = MovieParser() parser.feed((s.read()).decode(‘utf-8‘)) s.close()
再重載HTMLParser庫的handle_starttag(self, tag, attrs),系統就會默認調用用戶重載的,具體調用方式在官方文檔裏面詳細介紹:HTMLParser,
class MovieParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.movies = [] # 重載HTMLParser自帶的函數 def handle_starttag(self, tag, attrs):def _attr(attrlist, attrname): for attr in attrlist: if attr[0] == attrname: return attr[1] return None # 可以在這class後面找到每個li標簽的特征屬性比如catrgory在下面判斷 if tag == ‘li‘ and _attr(attrs, ‘data-title‘): movie= {} movie[‘title‘] = _attr(attrs, ‘data-title‘) movie[‘rate‘] = _attr(attrs, ‘data-rate‘) movie[‘director‘] = _attr(attrs, ‘data-director‘) movie[‘actors‘] = _attr(attrs, ‘data-actors‘) self.movies.append(movie) print(‘%(title)s|%(rate)s|%(director)s|%(actors)s‘ % movie)
當我們執行到parser.feed((s.read()).decode(‘utf-8‘))時,知道為什麽要這樣寫,首先parser時HTMLParser的子類所以包括feed(),在註入數據時,s.read()是返回bytes類型,但feed()只接受str類型,所以直接在後面加個decode(‘utf-8‘)即轉碼(三個bytes轉換為一個中文),又可以轉換為str,基本獲取數據就這麽簡單,要是想獲取別的網站的數據,可以換個url和條件判斷就可以了,我把全部代碼貼上來:
from urllib import request import urllib from html.parser import HTMLParser class MovieParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.movies = [] # 重載HTMLParser自帶的函數 def handle_starttag(self, tag, attrs): def _attr(attrlist, attrname): for attr in attrlist: if attr[0] == attrname: return attr[1] return None # 可以在這class後面找到每個li標簽的特征屬性比如catrgory在下面判斷 if tag == ‘li‘ and _attr(attrs, ‘data-title‘): movie= {} movie[‘title‘] = _attr(attrs, ‘data-title‘) movie[‘rate‘] = _attr(attrs, ‘data-rate‘) movie[‘director‘] = _attr(attrs, ‘data-director‘) movie[‘actors‘] = _attr(attrs, ‘data-actors‘) self.movies.append(movie) print(‘%(title)s|%(rate)s|%(director)s|%(actors)s‘ % movie) def print_movies(url): # 偽裝成瀏覽器訪問網站,但其實沒啥用,很容易被中間件檢測出來,但沒有又不行,所以蠻寫吧 header = { ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36‘} # Python3的urllib req = urllib.request.Request(url, headers=header) s = urllib.request.urlopen(req) parser = MovieParser() parser.feed((s.read()).decode(‘utf-8‘)) s.close() if __name__ == ‘__main__‘: url = ‘https://movie.douban.com/‘ # 返回一個電影列表 print_movies(url)
運行結果為:
Python3.5:爬取網站上電影數據