Python呼叫豆瓣API抓取top250電影並存儲進資料庫
阿新 • • 發佈:2019-02-17
前言:學習了一段時間的python,想要把學習到的東西串起來做一遍,於是有了這個小程式,初學者,記錄自己的學習過程 ^-^
Python廣泛應用於爬蟲程式,但是爬蟲程式有時候需要對頁面做複雜的解析,正則匹配,對於初學者來說,在學習操作的過程中,往往會發現問題一個接著一個,出現半途而廢的情況。其實現在很多網站都有對外提供API,有時候使用API也能夠獲得想要的資料,而且更友好。
豆瓣API地址:https://developers.douban.com/wiki/?title=api_v2
在這裡面能夠找到豆瓣對外提供的各種介面,我這次選的是獲取Top250的電影資訊:https://api.douban.com/v2/movie/top250
#!/usr/bin/python #coding: utf-8 import urllib2 import json import sqlite3 n=0 #url = 'https://api.douban.com/v2/movie/in_theaters' fname = 'D:/Python/workspace/test0814/Top250' #下載檔案的臨時儲存位置 conn = sqlite3.connect('DoubanApi.sqlite') cur = conn.cursor() cur.executescript(''' DROP TABLE IF EXISTS Top250; CREATE TABLE Top250 ( id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, film_name TEXT UNIQUE, film_director TEXT, film_artist1 TEXT, film_artist2 TEXT, film_year INTERGER, film_rate FLOAT, film_id INTERGER ); ''') #建立表 def GetData(startnum): #獲取資料的函式 url = 'https://api.douban.com/v2/movie/top250?start='+ str(startnum) +'&count=100' #由於豆瓣限制一次拉取最多100條資訊,因此需要分多次拉取,start為拉取動作開始的位置 urlfile = urllib2.urlopen(url).read() fw = open(fname,'w') fw.write(urlfile) #將拉取到的檔案存起來,其實可以不用存,直接操作 fw.close() print "file download success." def JsonParse(count):#解析json並加到資料庫中 fr = open(fname).read() fjson = json.loads(fr) for film in range(count): #通過解析json,獲得需要的幾個資訊 try: film_name = fjson['subjects'][film]['title'] film_rate = fjson['subjects'][film]['rating']['average'] film_artist1 = fjson['subjects'][film]['casts'][0]['name'] film_artist2 = fjson['subjects'][film]['casts'][1]['name'] film_director = fjson['subjects'][film]['directors'][0]['name'] film_id = fjson['subjects'][film]['id'] film_year = fjson['subjects'][film]['year'] except: print "Parse failed." try: #將資料存進資料庫中 cur.execute('''INSERT OR REPLACE INTO Top250 (film_name, film_director, film_artist1, film_artist2,film_year,film_rate,film_id) VALUES ( ?, ?, ?, ?, ? ,? ,?)''', (film_name, film_director, film_artist1, film_artist2,film_year,film_rate,film_id ) ) #conn.commit() except: print "SQL failed." conn.commit() print "JsonParse done" GetData(0) #由於API介面的條數限制,需要分多次執行,這樣看起來比較蠢,可以優化一下 JsonParse(100) GetData(100) JsonParse(100) GetData(200) JsonParse(50)
附上我自己排版了一下的豆瓣top250 json的格式:
{ "count": 100, "start": 200, "total": 250, "subjects": [ { "rating": {"max": 10, "average": 8.6, "stars": "45", "min": 0}, "genres": ["\u72af\u7f6a", "\u5267\u60c5", "\u60ca\u609a"], "title": "\u672b\u8def\u72c2\u82b1", "casts": [ { "alt": "https:\/\/movie.douban.com\/celebrity\/1054392\/", "avatars": {"small": "https://img3.doubanio.com\/img\/celebrity\/small\/10323.jpg", "large": "https://img3.doubanio.com\/img\/celebrity\/large\/10323.jpg", "medium": "https://img3.doubanio.com\/img\/celebrity\/medium\/10323.jpg"}, "name": "\u82cf\u73ca\u00b7\u8428\u5170\u767b", "id": "1054392" }, { "alt": "https:\/\/movie.douban.com\/celebrity\/1048131\/", "avatars": {"small": "https://img3.doubanio.com\/img\/celebrity\/small\/28805.jpg", "large": "https://img3.doubanio.com\/img\/celebrity\/large\/28805.jpg", "medium": "https://img3.doubanio.com\/img\/celebrity\/medium\/28805.jpg"}, "name": "\u54c8\u5a01\u00b7\u51ef\u7279\u5c14", "id": "1048131" } ], "collect_count": 111626, "original_title": "Thelma & Louise", "subtype": "movie", "directors": [ { "alt": "https:\/\/movie.douban.com\/celebrity\/1054416\/", "avatars": {"small": "https://img1.doubanio.com\/img\/celebrity\/small\/588.jpg", "large": "https://img1.doubanio.com\/img\/celebrity\/large\/588.jpg", "medium": "https://img1.doubanio.com\/img\/celebrity\/medium\/588.jpg"}, "name": "\u96f7\u5fb7\u5229\u00b7\u65af\u79d1\u7279", "id": "1054416" } ], "year": "1991", "images": {"small": "https://img3.doubanio.com\/view\/movie_poster_cover\/ipst\/public\/p794583044.jpg", "large": "https://img3.doubanio.com\/view\/movie_poster_cover\/lpst\/public\/p794583044.jpg", "medium": "https://img3.doubanio.com\/view\/movie_poster_cover\/spst\/public\/p794583044.jpg"}, "alt": "https:\/\/movie.douban.com\/subject\/1291992\/", "id": "1291992" } 此處省略列表的後99個元素 ]
大功告成
為了簡單起見,我用的是sqlite資料庫,firefox瀏覽器直接安裝一個外掛SQLite Manager就可以用介面查看了
大功告成,作為初學者瞬間又有了繼續學習的動力了。