1. 程式人生 > >到豆瓣爬取電影信息

到豆瓣爬取電影信息

wow64 mov self. use safari 代碼 app itl ike

初學puthon爬蟲,於是自己怕了豆瓣以電影信息,直接上源碼

import re
import requests
from bs4 import BeautifulSoup
import urllib
import os

class movie:

    def __init__(self):
        self.url="https://movie.douban.com/subject/25933890/?tag=%E7%83%AD%E9%97%A8&from=gaia_video"
        self.head={
            User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
, } def getpag(self): req=requests.get(self.url,self.head) html=req.content html=html.decode(utf-8) return html def gettit(self,page): title = r<span property="v:itemreviewed">(.+?)</span> power = r<strong class="ll rating_num" property="v:average">(.+?)</strong>
tit = re.findall(title, page) powe = re.findall(power,page) tit = str(tit) print(tit, \n) print("豆瓣評分:", powe, \n) def getinfo(self,page): soup = BeautifulSoup(page, "lxml") infor = soup.find_all(div, info) for info in infor:
print(info.get_text()) def getping(self,page): soup = BeautifulSoup(page, "lxml") ping = soup.find_all(div, comment) for pin in ping: pname=pin.fin pn=pname.find_all(a).d_all(span,class_=comment-info) for pnam in pname: for p in pn: print(p.get_text()) arg=pin.find_all(p) for ar in arg: print(ar.get_text()) def start(self): page=self.getpag() self.gettit(page) self.getinfo(page) self.getping(page) movie().start()

爬取成功技術分享圖片

我利用的是BeautifulSoup設個庫,這個庫將可以將heml代碼進行按標簽進行分類整理,還可以讀取標簽屬性,詳情可以自己搜索,對於爬蟲來說非常強大

我的代碼理念理念是利用BeautifulSoup,利用for循環一層一層的往下搜索找到自己想要的數據

到豆瓣爬取電影信息