多種方法爬取貓眼電影Top100排行榜,儲存到csv檔案,下載封面圖

阿新 • • 發佈：2019-01-09

參考連結:https://blog.csdn.net/BF02jgtRS00XKtCx/article/details/83663400

因貓眼網站有些更新,參考連結中的部分程式碼執行報錯,特修改一下

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import csv
import re
from multiprocessing.pool import Pool

import requests
from bs4 import BeautifulSoup
from lxml import etree
from requests.exceptions import 
 RequestException


def get_one_page(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0'
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
         
else:
            return None
    except RequestException:
        return None


# 獲取封面大圖
def get_thumb(url):
    # url = 'https://p0.meituan.net/movie/[email protected]_220h_1e_1c'
    pattern = re.compile(r'(.*?)@.*?')
    thumb = re.search(pattern, url)
    return thumb.group(1)
    # http://p0.meituan.net/movie/ 
[email protected]_220h_1e_1c
    # 去掉@160w_220h_1e_1c就是大圖


# 提取上映時間函式
def get_release_time(data):
    pattern = re.compile(r'(.*?)(\(|$)')
    items = re.search(pattern, data)
    if items is None:
        return '未知'
    return items.group(1)  # 返回匹配到的第一個括號(.*?)中結果即時間


# 提取國家/地區函式
def get_release_area(data):
    pattern = re.compile(r'.*\((.*)\)')
    # $表示匹配一行字串的結尾，這裡就是(.*?)；(|$,表示匹配字串含有(,或者只有(.*?)
    items = re.search(pattern, data)
    if items is None:
        return '未知'
    return items.group(1)


# 使用正則表示式的寫法
def parse_one_page(html):
    pattern = re.compile(
        '<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',
        re.S)  # re.S表示匹配任意字元，如果不加，則無法匹配換行符
    items = re.findall(pattern, html)
    for item in items:
        yield {
            'index': item[0],
            'thumb': get_thumb(item[1]),  # 定義get_thumb()方法進一步處理網址
            'name': item[2],
            'star': item[3].strip()[3:],
            # 'time': item[4].strip()[5:],
            # 用一個方法分別提取time裡的日期和地區
            'time': get_release_time(item[4].strip()[5:]),
            'area': get_release_area(item[4].strip()[5:]),
            'score': item[5].strip() + item[6].strip()
            # 評分score由整數+小數兩部分組成
        }


# lxml結合xpath提取
def parse_one_page2(html):
    parse = etree.HTML(html)
    items = parse.xpath('/html/body/div[4]//div//dd')
    for item in items:
        yield {
            'index': item.xpath('./i/text()')[0],
            'thumb': get_thumb(str(item.xpath('./a/img[2]/@data-src')[0].strip())),
            'name': item.xpath('./div/div/div[1]/p[1]/a/@title')[0],
            'star': item.xpath('.//p[@class="star"]/text()')[0].strip()[3:],
            'realease_time': get_release_time(item.xpath('.//p[@class="releasetime"]/text()')[0].strip()[5:]),
            'area': get_release_area(item.xpath('.//p[@class="releasetime"]/text()')[0].strip()[5:]),
            'score': item.xpath('./div/div/div[2]/p/i[1]/text()')[0] + item.xpath('./div/div/div[2]/p/i[2]/text()')[0],
        }

# 使用BeautifulSoup結合css選擇器
def parse_one_page3(html):
    soup = BeautifulSoup(html, 'lxml')
    items = range(10)
    for item in items:
        yield {
            'index': soup.select('i.board-index')[item].string,
            'thumb': get_thumb(soup.select('.board-img')[item]['data-src']),
            'name': soup.select('.name a')[item].string,
            'star': soup.select('.star')[item].string.strip()[3:],
            'time': get_release_time(soup.select('.releasetime')[item].string.strip()[5:]),
            'area': get_release_area(soup.select('.releasetime')[item].string.strip()[5:]),
            'score': soup.select('.integer')[item].string + soup.select('.fraction')[item].string
        }

# Beautiful Soup + find_all函式提取
def parse_one_page4(html):
    soup = BeautifulSoup(html, 'lxml')
    items = range(10)
    for item in items:
        yield {
            'index': soup.find_all(class_='board-index')[item].string,
            'thumb': get_thumb(soup.find_all(class_='board-img')[item].attrs['data-src']),
            'name': soup.find_all(name='p', attrs={'class': 'name'})[item].string,
            'star': soup.find_all(name='p', attrs={'class': 'star'})[item].string.strip()[3:],
            'time': get_release_time(soup.find_all(class_='releasetime')[item].string.strip()[5:]),
            'area': get_release_area(soup.find_all(class_='releasetime')[item].string.strip()[5:]),
            'score': soup.find_all(name='i', attrs={'class': 'integer'})[item].string +
                     soup.find_all(name='i', attrs={'class': 'fraction'})[item].string
        }


# 資料儲存到csv
def write_to_file3(item):
    with open('貓眼top100.csv', 'a', encoding='utf_8_sig', newline='') as f:
        # 'a'為追加模式（新增）
        # utf_8_sig格式匯出csv不亂碼
        fieldnames = ['index', 'thumb', 'name', 'star', 'time', 'area', 'score']
        w = csv.DictWriter(f, fieldnames=fieldnames)
        # w.writeheader()
        w.writerow(item)


# 下載封面圖片
def download_thumb(name, url, num):
    try:
        response = requests.get(url)
        with open('封面圖/' + name + '.jpg', 'wb') as f:
            f.write(response.content)
            print('第%s部電影封面下載完畢' % num)
            print('------')
    except RequestException as e:
        print(e)
        pass
    # 不能是w，否則會報錯，因為圖片是二進位制資料所以要用wb


def main(offset):
    url = 'http://maoyan.com/board/4?offset=' + str(offset)
    html = get_one_page(url)
    for item in parse_one_page4(html):
        write_to_file3(item)
        download_thumb(item['name'], item['thumb'], item['index'])


if __name__ == '__main__':
    pool = Pool()
    pool.map(main, [i * 10 for i in range(10)])

多種方法爬取貓眼電影Top100排行榜,儲存到csv檔案,下載封面圖

參考連結:https://blog.csdn.net/BF02jgtRS00XKtCx/article/details/83663400 因貓眼網站有些更新,參考連結中的部分程式碼執行報錯,特修改一下 #!/usr/bin/env python # -*- coding: utf-8

python爬取貓眼電影top100排行榜

技術所有結果 mys url with 地址保存 pic 爬取貓眼電影TOP100(http://maoyan.com/board/4?offset=90)1). 爬取內容: 電影名稱，主演，上映時間，圖片url地址保存到mariadb數據庫中;2). 所有的圖片保

20170513爬取貓眼電影Top100

top compile bs4 etime http res XML n) quest import jsonimport reimport requestsfrom bs4 import BeautifulSoupfrom requests import RequestE

使用requests爬取貓眼電影TOP100榜單

esp 進行得到 ensure .com key d+ odin pickle 　　Requests是一個很方便的python網絡編程庫，用官方的話是“非轉基因，可以安全食用”。裏面封裝了很多的方法，避免了urllib/urllib2的繁瑣。　　這一節使用request

爬蟲（七）：爬取貓眼電影top100

all for rip pattern 分享爬取 values findall proc 一：分析網站目標站和目標數據目標地址：http://maoyan.com/board/4?offset=20目標數據：目標地址頁面的電影列表，包括電影名，電影圖片，主演，上映日期以

用Requests和正則表示式爬取貓眼電影(TOP100+最受期待榜）

目標站點分析目標站點（貓眼榜單TOP100）：如下圖，貓眼電影的翻頁offset明顯在URL中，所以只要搞定第一頁的內容加上一個迴圈加上offset就可以爬取前100。流程框架 1、抓取單頁內容利用requests請求目標站點，得到單個網頁HTML程式碼，返回結

Python爬蟲實戰之Requests+正則表示式爬取貓眼電影Top100

import requests from requests.exceptions import RequestException import re import json # from multiprocessing import Pool # 測試了下這裡需要自己新增頭部否則得不到網頁 hea

爬蟲練習 | 爬取貓眼電影Top100

#coding=utf-8 _date_ = '2018/12/9 16:18' import requests import re import json import time def get_one_page(url): headers={ 'User-Agent':'Mozil

python爬蟲，爬取貓眼電影top100

import requests from bs4 import BeautifulSoup url_list = [] all_name = [] all_num = [] all_actor = [] all_score = [] class Product_url():

python爬蟲爬取貓眼電影top100

這個爬蟲我是跟著教程做的，也是第一次用python的re和multiprocessing（多執行緒），還知道了yield生成器的用法。不過re正則表示式真的厲害，但是學起來比較難，還在學習中。import requests import re import pymysql f

【爬蟲】爬取貓眼電影top100

用正則表示式爬取 #!/usr/bin/python # -*- coding: utf-8 -*- import json # 快速匯入此模組：滑鼠先點到要匯入的函式處，再Alt + Enter進行選擇 from multiprocessing.pool im

Python爬蟲-爬取貓眼電影Top100榜單

貓眼電影的網站html組成十分簡單。地址就是很簡單的offset=x 這個x引數更改即可翻頁。下面的資訊使用正則表示式很快就可以得出結果。直接放程式碼： import json

Requests+正則表示式爬取貓眼電影top100

#!/usr/bin/python #coding=utf-8 # import requests from requests.exceptions import RequestException import re import json from multiproces

Python3爬取貓眼電影Top100(多程序)

分析過程：網頁原始碼關鍵部分（一對<dd></dd>標籤包括所有主要資訊）： <div class="content"> <div class="wrapper"> <div cl

【Python】Requests+正則表示式爬取貓眼電影TOP100

1.先獲取到一個頁面，狀態碼200是成功返回 def get_one_page(url): # 獲取一個頁面 try: response = requests.get(url) if response.status_cod

50行Python程式碼教你爬取貓眼電影TOP100榜所有資訊

來源： https://zhuanlan.zhihu.com/c_149865214對於Python初學者來說，爬蟲技能是應該是最好入門，也是最能夠有讓自己有成就感的，今天，戀習Python的手把手系列，手把手教你入門Python爬蟲，爬取貓眼電影TOP100榜資訊，將涉及到

python爬取貓眼電影top100榜

python版本：3.6 程式碼如下 import json from multiprocessing.pool import Pool from requests.exceptions impo

利用requests和正則爬取貓眼電影top100榜單

環境：win10， anaconda3（python3.5）用python的requests庫和正則將貓眼電影top100榜單資訊抓取下來，儲存資料並做了點簡單的分析。（貓眼電影top100榜單網頁的原始碼可能發生了改變，程式里正則那邊可能不適用了，需要修改。）下面

python網路爬蟲例項：Requests+正則表示式爬取貓眼電影TOP100榜

一、前言最近在看崔慶才先生編寫的《Python3網路爬蟲開發實戰》這本書，學習了requests庫和正則表示式，爬取貓眼電影top100榜單是這本書的第一個例項，主要目的是要掌握requests庫和正則表示式在實際案例中的使用。二、開發環境執行平

python爬蟲實戰：利用pyquery爬取貓眼電影TOP100榜單內容-2

上次利用pyquery爬取貓眼電影TOP100榜單內容的爬蟲程式碼中點選開啟連結存在幾個不合理點。1. 第一個就是自定義的create_file（檔案存在判斷及建立）函式。我在後來的python檔案功能相關學習中，發現這個自定義函式屬於重複造輪子功能。因為 for data

多種方法爬取貓眼電影Top100排行榜,儲存到csv檔案,下載封面圖

相關推薦