1. 程式人生 > >python學習(23)requests庫爬取貓眼電影

python學習(23)requests庫爬取貓眼電影

本文介紹如何結合前面講解的基本知識,採用requests,正則表示式,cookies結合起來,做一次實戰,抓取貓眼電影排名資訊。

用requests寫一個基本的爬蟲

排行資訊大致如下圖

1.jpg
網址連結為http://maoyan.com/board/4?offset=0
我們通過點選檢視原始檔,可以看到網頁資訊
2.png
每一個電影的html資訊都是下邊的這種結構

<i class="board-index board-index-3">3</i>
    <a href="/films/2641" title="羅馬假日" class="image-link" data-act="boarditem-click
" data-val="{movieId:2641}"> <img src="//ms0.meituan.net/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" /> <img data-src="http://p0.meituan.net/movie/[email protected]_220h_1e_1c" alt="羅馬假日" class="board-img" /> </a> <div class="board-item-main"> <div class
="board-item-content"> <div class="movie-item-info"> <p class="name"><a href="/films/2641" title="羅馬假日" data-act="boarditem-click" data-val="{movieId:2641}">羅馬假日</a></p> <p class="star"> 主演:格利高裡·派克,奧黛麗·赫本,埃迪·艾伯特
</p>

其實對我們有用的就是 img src(圖片地址) title 電影名 star 主演。 

所以根據前邊介紹過的正則表示式寫法,可以推匯出正則表示式

compilestr = r'''<dd>.*?<i class="board-index.*?<img data-src="(.*?)@.*?title="(.*?)".*?<p class="star">
(.*?)</p>.*?<p class="releasetime">.*?(.*?)</p'''

‘.’表示匹配任意字元,如果正則表示式用re.S模式,.還可以匹配換行符,’‘表示匹配前一個字元0到n個,’?’表示非貪婪匹配, 

所以’.?’可以理解為匹配任意字元。接下來寫程式碼列印我們匹配的條目

import requests
import re
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'

if __name__ == "__main__":
    headers={'User-Agent':USER_AGENT,
           }
    session = requests.Session()
    req = session.get('http://maoyan.com/board/4?offset=0',headers = headers, timeout = 5)
    compilestr = r'<dd>.*?<i class="board-index.*?<img data-src="(.*?)@.*?title="(.*?)".*?<p class="star">(.*?)</p>.*?<p class="releasetime">.*?(.*?)</p'
    #print(req.content)
    pattern = re.compile(compilestr,re.S)
    #print(req.content.decode('utf-8'))
    lists = re.findall(pattern,req.content.decode('utf-8'))
    for item in lists:
        #print(item)
        print(item[0].strip())
        print(item[1].strip())
        print(item[2].strip())
        print(item[3].strip())
        print('\n')

執行一下,結果如下 

3.png
看來我們抓取到資料了,我們只爬取了這一頁的資訊,接下來我們分析第二頁,第三頁的規律,點選第二頁,網址變為’http://maoyan.com/board/4?offset=10',點選第三頁網址變為'http://maoyan.com/board/4?offset=20',所以每一頁的offset偏移量為20,這樣我們可以計算偏移量達到抓取不同頁碼的資料,將上一個程式稍作修改,變為可以爬取n頁資料的程式

import requests
import re
import time
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'

class MaoYanScrapy(object):
    def __init__(self,pages=1):
        self.m_session = requests.Session()
        self.m_headers = {'User-Agent':USER_AGENT,}
        self.m_compilestr = r'<dd>.*?<i class="board-index.*?<img data-src="(.*?)@.*?title="(.*?)".*?<p class="star">(.*?)</p>.*?<p class="releasetime">.*?(.*?)</p'
        self.m_pattern = re.compile(self.m_compilestr,re.S)
        self.m_pages = pages
    
    def getPageData(self):
        try:
            for i in range(self.m_pages):
                httpstr = 'http://maoyan.com/board/4?offset='+str(i)
                req = self.m_session.get(httpstr,headers=self.m_headers,timeout=5)
                lists = re.findall(self.m_pattern,req.content.decode('utf-8'))
                time.sleep(1)
                for item in lists:
                    img = item[0]
                    print(img.strip()+'\n')
                    name = item[1]
                    print(name.strip()+'\n')
                    actor = item[2]
                    print(actor.strip()+'\n')
                    fiemtime = item[3]
                    print(fiemtime.strip()+'\n')
                

        except:
            print('get error')

if __name__ == "__main__":
    maoyanscrapy = MaoYanScrapy()
    maoyanscrapy.getPageData()

執行下,效果和之前一樣,只是支援了頁碼的傳參了。 

下面繼續完善下程式,把每個電影的圖片抓取並儲存下來,這裡面用到了建立資料夾,路徑拼接,檔案儲存的基礎知識,綜合運用如下

import requests
import re
import time
import os
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'

class MaoYanScrapy(object):
    def __init__(self,pages=1):
        self.m_session = requests.Session()
        self.m_headers = {'User-Agent':USER_AGENT,}
        self.m_compilestr = r'<dd>.*?<i class="board-index.*?<img data-src="(.*?)@.*?title="(.*?)".*?<p class="star">(.*?)</p>.*?<p class="releasetime">.*?(.*?)</p'
        self.m_pattern = re.compile(self.m_compilestr,re.S)
        self.m_pages = pages
        self.dirpath = os.path.split(os.path.abspath(__file__))[0]
        
    
    def getPageData(self):
        try:
            for i in range(self.m_pages):
                httpstr = 'http://maoyan.com/board/4?offset='+str(i)
                req = self.m_session.get(httpstr,headers=self.m_headers,timeout=5)
                lists = re.findall(self.m_pattern,req.content.decode('utf-8'))
                time.sleep(1)
                for item in lists:
                    img = item[0]
                    print(img.strip()+'\n')
                    name = item[1]
                    dirpath = os.path.join(self.dirpath,name)
                    if(os.path.exists(dirpath)==False):
                        os.makedirs(dirpath)
                    print(name.strip()+'\n')
                    actor = item[2]
                    print(actor.strip()+'\n')
                    fiemtime = item[3]
                    print(fiemtime.strip()+'\n')
                    txtname = name+'.txt'
                    txtname = os.path.join(dirpath,txtname)
                    if(os.path.exists(txtname)==True):
                        os.remove(txtname)
                    with open (txtname,'w') as f:
                        f.write(img.strip()+'\n')
                        f.write(name.strip()+'\n')
                        f.write(actor.strip()+'\n')
                        f.write(fiemtime.strip()+'\n')
                    picname=os.path.join(dirpath,name+'.'+img.split('.')[-1])
                    if(os.path.exists(picname)):
                        os.remove(picname)
                    req=self.m_session.get(img,headers=self.m_headers,timeout=5)
                    time.sleep(1)
                    with open(picname,'wb') as f:
                        f.write(req.content)
        except:
            print('get error')

if __name__ == "__main__":
    maoyanscrapy = MaoYanScrapy()
    maoyanscrapy.getPageData()

執行一下,可以看到在檔案的目錄裡多了幾個資料夾

4.png
點選一個資料夾,看到裡邊有我們儲存的圖片和資訊
5.png
好了,到此為止,正則表示式和requests結合,做的爬蟲實戰完成。
謝謝關注我的公眾號:
wxgzh.jpg