1. 程式人生 > >Python3.5:爬取網站上電影數據

Python3.5:爬取網站上電影數據

x64 沒有 () nbsp 運行 lpar target __init__ doc

首先我們導入幾個pyhton3的庫:

from urllib import request
import urllib
from html.parser import HTMLParser

在Python2和Python3之間一個重要區別就是,在Python2有urllib,urllib2兩個庫,在Python3整合到一起,裏面的函數方式也有一點變,先定義一個函數,將header,url,request,都打包成一個函數方便調用,且看下面代碼:

def print_movies(url):
    # 偽裝成瀏覽器訪問網站,但其實沒啥用,很容易被中間件檢測出來,但沒有又不行,所以蠻寫吧
    header = {
        
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36} # Python3的urllib req = urllib.request.Request(url, headers=header) s = urllib.request.urlopen(req) parser = MovieParser() parser.feed((s.read()).decode(utf-8
)) s.close()

再重載HTMLParser庫的handle_starttag(self, tag, attrs),系統就會默認調用用戶重載的,具體調用方式在官方文檔裏面詳細介紹:HTMLParser,

class MovieParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.movies = []
        # 重載HTMLParser自帶的函數
    def handle_starttag(self, tag, attrs):
        
def _attr(attrlist, attrname): for attr in attrlist: if attr[0] == attrname: return attr[1] return None # 可以在這class後面找到每個li標簽的特征屬性比如catrgory在下面判斷 if tag == li and _attr(attrs, data-title): movie= {} movie[title] = _attr(attrs, data-title) movie[rate] = _attr(attrs, data-rate) movie[director] = _attr(attrs, data-director) movie[actors] = _attr(attrs, data-actors) self.movies.append(movie) print(%(title)s|%(rate)s|%(director)s|%(actors)s % movie)

當我們執行到parser.feed((s.read()).decode(‘utf-8‘))時,知道為什麽要這樣寫,首先parser時HTMLParser的子類所以包括feed(),在註入數據時,s.read()是返回bytes類型,但feed()只接受str類型,所以直接在後面加個decode(‘utf-8‘)即轉碼(三個bytes轉換為一個中文),又可以轉換為str,基本獲取數據就這麽簡單,要是想獲取別的網站的數據,可以換個url和條件判斷就可以了,我把全部代碼貼上來:

from urllib import request
import urllib
from html.parser import HTMLParser

class MovieParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.movies = []
        # 重載HTMLParser自帶的函數
    def handle_starttag(self, tag, attrs):
        def _attr(attrlist, attrname):
            for attr in attrlist:
                if attr[0] == attrname:
                    return attr[1]
            return None
        # 可以在這class後面找到每個li標簽的特征屬性比如catrgory在下面判斷
        if tag == li and _attr(attrs, data-title):
            movie= {}
            movie[title] = _attr(attrs, data-title)
            movie[rate] = _attr(attrs, data-rate)
            movie[director] = _attr(attrs, data-director)
            movie[actors] = _attr(attrs, data-actors)
            self.movies.append(movie)
            print(%(title)s|%(rate)s|%(director)s|%(actors)s % movie)

def print_movies(url):
    # 偽裝成瀏覽器訪問網站,但其實沒啥用,很容易被中間件檢測出來,但沒有又不行,所以蠻寫吧
    header = {
        User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36}
    # Python3的urllib
    req = urllib.request.Request(url, headers=header)
    s = urllib.request.urlopen(req)
    parser = MovieParser()
    parser.feed((s.read()).decode(utf-8))
    s.close()


if __name__ == __main__:
    url = https://movie.douban.com/
    # 返回一個電影列表
    print_movies(url)

運行結果為:

技術分享

Python3.5:爬取網站上電影數據