1. 程式人生 > >Python + Selenium 爬取網易雲課堂課時標題及時長

Python + Selenium 爬取網易雲課堂課時標題及時長

Python + Selenium 爬取網易雲課堂課時標題及時長

轉載請註明出處:https://blog.csdn.net/jpch89/article/details/84142555


文章目錄


軟體安裝


目標頁面

https://study.163.com/course/introduction.htm?courseId=1006078212#/courseDetail?tab=1

  • 一開始用常規方法請求下來,發現原始碼中根本找不到任何課時資訊,說明該網頁用 JavaScript 來動態載入內容。
  • 使用開發者工具分析一下,發現瀏覽器請求瞭如下的地址獲取課時詳情資訊:
    https://study.163.com/dwr/call/plaincall/PlanNewBean.getPlanCourseDetail.dwr?1542346982156
  • 在預覽介面可以看到各課時資訊的 Unicode 編碼。
    在這裡插入圖片描述
  • 嘗試直接請求上述地址,顯然會報錯,不想去研究請求頭具體應該傳哪些引數了,直接上 Selenium,反正就爬一個頁面,對效能沒什麼要求。

程式碼

說明

  • study163seleniumff.py主執行檔案
  • helper.py 是輔助模組,與主執行檔案同目錄
  • geckodriver.exe 需要放在 ../drivers/ 這個相對路徑下

study163seleniumff.py

from selenium.webdriver import Firefox
from
selenium.webdriver.firefox.options import Options from lxml import etree import csv from helper import Chapter, Lesson # 請求資料 url = 'https://study.163.com/course/introduction.htm?courseId=1006078212#/courseDetail?tab=1' options = Options() options.add_argument('-headless') # 無頭引數 driver = Firefox( executable_path='../drivers/geckodriver', firefox_options=options) driver.get(url) text = driver.page_source driver.quit() # 解析資料 html = etree.HTML(text) chapters = html.xpath('//div[@class="chapter"]') TABLEHEAD = ['章節號', '章節名', '課時號', '課時名', '課時長'] rows = [] for each in chapters: chapter = Chapter(each) lessons = chapter.get_lessons() for each in lessons: lesson = Lesson(each) chapter_info = chapter.chapter_info lesson_info = lesson.lesson_info values = (*chapter_info, *lesson_info) row = dict(zip(TABLEHEAD, values)) rows.append(row) # 儲存資料 with open('courseinfo.csv', 'w', encoding='utf-8-sig', newline='') as f: writer = csv.DictWriter(f, TABLEHEAD) writer.writeheader() writer.writerows(rows)

helper.py

class Chapter:
    def __init__(self, chapter):
        self.chapter = chapter
        self._chapter_info = None

    def parse_all(self):
        # 章節號
        chapter_num = self.chapter.xpath(
            './/span[contains(@class, "chaptertitle")]/text()')[0]
        # 去掉章節號最後的冒號
        chapter_num = chapter_num[:-1]
        # 章節名
        chapter_name = self.chapter.xpath(
            './/span[contains(@class, "chaptername")]/text()')[0]
        return chapter_num, chapter_name

    @property
    def chapter_info(self):
        self._chapter_info = self.parse_all()
        return self._chapter_info
    
    def get_lessons(self):
        return self.chapter.xpath(
            './/div[@data-lesson]')


class Lesson:
    def __init__(self, lesson):
        self.lesson = lesson
        self._lesson_info = None

    @property
    def lesson_info(self):
        # 課時號
        lesson_num = self.lesson.xpath(
            './/span[contains(@class, "ks")]/text()')[0]
        # 課時名
        lesson_name = self.lesson.xpath(
            './/span[@title]/@title')[0]
        # 課時長
        lesson_len = self.lesson.xpath(
            './/span[contains(@class, "kstime")]/text()')[0]
        self._lesson_info = lesson_num, lesson_name, lesson_len
        return self._lesson_info


最終結果

最終結果儲存為 courseinfo.csv,與主執行檔案同路徑。
在這裡插入圖片描述


完成於 2018.11.16