1. 程式人生 > >Python呼叫豆瓣API抓取top250電影並存儲進資料庫

Python呼叫豆瓣API抓取top250電影並存儲進資料庫

前言:學習了一段時間的python,想要把學習到的東西串起來做一遍,於是有了這個小程式,初學者,記錄自己的學習過程   ^-^

Python廣泛應用於爬蟲程式,但是爬蟲程式有時候需要對頁面做複雜的解析,正則匹配,對於初學者來說,在學習操作的過程中,往往會發現問題一個接著一個,出現半途而廢的情況。其實現在很多網站都有對外提供API,有時候使用API也能夠獲得想要的資料,而且更友好。

豆瓣API地址:https://developers.douban.com/wiki/?title=api_v2
在這裡面能夠找到豆瓣對外提供的各種介面,我這次選的是獲取Top250的電影資訊:
https://api.douban.com/v2/movie/top250

#!/usr/bin/python
#coding: utf-8
import urllib2
import json
import sqlite3
n=0
#url = 'https://api.douban.com/v2/movie/in_theaters'
fname = 'D:/Python/workspace/test0814/Top250'  #下載檔案的臨時儲存位置
conn = sqlite3.connect('DoubanApi.sqlite')
cur = conn.cursor()
cur.executescript('''
DROP TABLE IF EXISTS Top250;

CREATE TABLE Top250 (
    id  INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    film_name    TEXT UNIQUE,
    film_director    TEXT,
    film_artist1    TEXT,
    film_artist2    TEXT,
    film_year    INTERGER,
    film_rate    FLOAT,
    film_id    INTERGER
);
''')        #建立表
def GetData(startnum):     #獲取資料的函式
    url = 'https://api.douban.com/v2/movie/top250?start='+ str(startnum) +'&count=100'  #由於豆瓣限制一次拉取最多100條資訊,因此需要分多次拉取,start為拉取動作開始的位置
    urlfile = urllib2.urlopen(url).read()
    fw = open(fname,'w')
    fw.write(urlfile)     #將拉取到的檔案存起來,其實可以不用存,直接操作
    fw.close()
    print "file download success."
def JsonParse(count):#解析json並加到資料庫中
    fr = open(fname).read()
    fjson = json.loads(fr)
    for film in range(count):  #通過解析json,獲得需要的幾個資訊
        try:
            film_name = fjson['subjects'][film]['title']
            film_rate = fjson['subjects'][film]['rating']['average']
            film_artist1 = fjson['subjects'][film]['casts'][0]['name']
            film_artist2 = fjson['subjects'][film]['casts'][1]['name']
            film_director = fjson['subjects'][film]['directors'][0]['name']
            film_id = fjson['subjects'][film]['id']
            film_year = fjson['subjects'][film]['year']
            
        except:
            print "Parse failed."
        try:   #將資料存進資料庫中
            cur.execute('''INSERT OR REPLACE INTO Top250
        (film_name, film_director, film_artist1, film_artist2,film_year,film_rate,film_id) 
        VALUES ( ?, ?, ?, ?, ? ,? ,?)''', 
        (film_name, film_director, film_artist1, film_artist2,film_year,film_rate,film_id ) )
            #conn.commit() 
        except:
            print "SQL failed."
    conn.commit()
    print "JsonParse done"

GetData(0)   #由於API介面的條數限制,需要分多次執行,這樣看起來比較蠢,可以優化一下
JsonParse(100)
GetData(100)
JsonParse(100)
GetData(200)
JsonParse(50)


附上我自己排版了一下的豆瓣top250 json的格式:

{
	"count": 100, 
	"start": 200, 
	"total": 250, 
	"subjects": 
		[
			{
				"rating": {"max": 10, "average": 8.6, "stars": "45", "min": 0}, 
				"genres": ["\u72af\u7f6a", "\u5267\u60c5", "\u60ca\u609a"], 
				"title": "\u672b\u8def\u72c2\u82b1", 
				"casts": 
					[
						{
							"alt": "https:\/\/movie.douban.com\/celebrity\/1054392\/", 
							"avatars": {"small": "https://img3.doubanio.com\/img\/celebrity\/small\/10323.jpg", 
							"large": "https://img3.doubanio.com\/img\/celebrity\/large\/10323.jpg", 
							"medium": "https://img3.doubanio.com\/img\/celebrity\/medium\/10323.jpg"}, 
							"name": "\u82cf\u73ca\u00b7\u8428\u5170\u767b", 
							"id": "1054392"
						}, 
						{
							"alt": "https:\/\/movie.douban.com\/celebrity\/1048131\/", 
							"avatars": {"small": "https://img3.doubanio.com\/img\/celebrity\/small\/28805.jpg", 
							"large": "https://img3.doubanio.com\/img\/celebrity\/large\/28805.jpg", 
							"medium": "https://img3.doubanio.com\/img\/celebrity\/medium\/28805.jpg"}, 
							"name": "\u54c8\u5a01\u00b7\u51ef\u7279\u5c14", "id": "1048131"
						}
					],
				"collect_count": 111626, 
				"original_title": "Thelma & Louise", 
				"subtype": "movie", 
				"directors": 
					[
						{
							"alt": "https:\/\/movie.douban.com\/celebrity\/1054416\/", 
							"avatars": {"small": "https://img1.doubanio.com\/img\/celebrity\/small\/588.jpg", 
							"large": "https://img1.doubanio.com\/img\/celebrity\/large\/588.jpg", 
							"medium": "https://img1.doubanio.com\/img\/celebrity\/medium\/588.jpg"}, 
							"name": "\u96f7\u5fb7\u5229\u00b7\u65af\u79d1\u7279", 
							"id": "1054416"
						}
					], 
				"year": "1991", 
				"images": {"small": "https://img3.doubanio.com\/view\/movie_poster_cover\/ipst\/public\/p794583044.jpg", 
				"large": "https://img3.doubanio.com\/view\/movie_poster_cover\/lpst\/public\/p794583044.jpg", 
				"medium": "https://img3.doubanio.com\/view\/movie_poster_cover\/spst\/public\/p794583044.jpg"}, 
				"alt": "https:\/\/movie.douban.com\/subject\/1291992\/", 
				"id": "1291992"
} 
			此處省略列表的後99個元素
		]

大功告成

為了簡單起見,我用的是sqlite資料庫,firefox瀏覽器直接安裝一個外掛SQLite Manager就可以用介面查看了


大功告成,作為初學者瞬間又有了繼續學習的動力了。