1. 程式人生 > >使用xpath爬取貓眼電影排行榜

使用xpath爬取貓眼電影排行榜

最近在學習xpath,在網上找資料的時候,發現一個新手經常拿來練手的專案,爬取貓眼電影前一百名排行的資訊,很多都是跟崔慶才的很雷同,基本照抄.這裡就用xpath自己寫了一個程式,同樣也是爬取貓眼電影,獲取的資訊是一樣的,這裡提供一個另外的解法.

說實話,對於網頁資訊的匹配,還是推薦用xpath,雖然正則確實也能達到效果,但是語句過於繁瑣,一不注意就匹配不出東西,特別對於新手,本身就不熟悉正則表示式,錯了都找不出來,容易勸退.正則我一般用於在處理檔案,簡直神器.

下面貼程式碼.

import requests
from requests.exceptions import RequestException
from lxml import etree
import csv
import re


def get_page(url):
    """
        獲取網頁的原始碼
    :param url:
    :return:
    """
    try:
        headers = {
            'User-Agent': 'Mozilla / 5.0(X11;Linuxx86_64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / '
                          '76.0.3809.100Safari / 537.36',

        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None


def parse_page(text):
    """
        解析網頁原始碼
    :param text:
    :return:
    """
    html = etree.HTML(text)
    movie_name = html.xpath("//p[@class='name']/a/text()")
    actor = html.xpath("//p[@class='star']/text()")
    actor = list(map(lambda item: re.sub('\s+', '', item), actor))
    time = html.xpath("//p[@class='releasetime']/text()")
    grade1 = html.xpath("//p[@class='score']/i[@class='integer']/text()")
    grade2 = html.xpath("//p[@class='score']/i[@class='fraction']/text()")
    new = [grade1[i] + grade2[i] for i in range(min(len(grade1), len(grade2)))]
    ranking = html.xpath("///dd/i/text()")
    return zip(ranking, movie_name, actor, time, new)


def change_page(number):
    """
        翻頁
    :param number:
    :return:
    """
    base_url = 'https://maoyan.com/board/4'
    url = base_url + '?offset=%s' % number
    return url


def save_to_csv(result, filename):
    """
        儲存
    :param result:
    :param filename:
    :return:
    """
    with open('%s' % filename, 'a') as csvfile:
        writer = csv.writer(csvfile, dialect='excel')
        writer.writerow(result)


def main():
    """
    主函式
    :return:
    """
    for i in range(0, 100, 10):
        url = change_page(i)
        text = get_page(url)
        result = parse_page(text)
        for j in result:
            save_to_csv(j, filename='message.csv')


if __name__ == '__main__':
    main()

&n