Python + Selenium 爬取網易雲課堂課時標題及時長
阿新 • • 發佈:2019-01-01
Python + Selenium 爬取網易雲課堂課時標題及時長
轉載請註明出處:https://blog.csdn.net/jpch89/article/details/84142555
文章目錄
軟體安裝
selenium
pip install selenium
geckodriver
https://github.com/mozilla/geckodriver/releases/
目標頁面
https://study.163.com/course/introduction.htm?courseId=1006078212#/courseDetail?tab=1
- 一開始用常規方法請求下來,發現原始碼中根本找不到任何課時資訊,說明該網頁用
JavaScript
來動態載入內容。 - 使用開發者工具分析一下,發現瀏覽器請求瞭如下的地址獲取課時詳情資訊:
https://study.163.com/dwr/call/plaincall/PlanNewBean.getPlanCourseDetail.dwr?1542346982156 - 在預覽介面可以看到各課時資訊的
Unicode
編碼。
- 嘗試直接請求上述地址,顯然會報錯,不想去研究請求頭具體應該傳哪些引數了,直接上
Selenium
,反正就爬一個頁面,對效能沒什麼要求。
程式碼
說明
study163seleniumff.py
是主執行檔案helper.py
是輔助模組,與主執行檔案同目錄geckodriver.exe
需要放在../drivers/
這個相對路徑下
study163seleniumff.py
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from lxml import etree
import csv
from helper import Chapter, Lesson
# 請求資料
url = 'https://study.163.com/course/introduction.htm?courseId=1006078212#/courseDetail?tab=1'
options = Options()
options.add_argument('-headless') # 無頭引數
driver = Firefox(
executable_path='../drivers/geckodriver',
firefox_options=options)
driver.get(url)
text = driver.page_source
driver.quit()
# 解析資料
html = etree.HTML(text)
chapters = html.xpath('//div[@class="chapter"]')
TABLEHEAD = ['章節號', '章節名', '課時號', '課時名', '課時長']
rows = []
for each in chapters:
chapter = Chapter(each)
lessons = chapter.get_lessons()
for each in lessons:
lesson = Lesson(each)
chapter_info = chapter.chapter_info
lesson_info = lesson.lesson_info
values = (*chapter_info, *lesson_info)
row = dict(zip(TABLEHEAD, values))
rows.append(row)
# 儲存資料
with open('courseinfo.csv', 'w', encoding='utf-8-sig', newline='') as f:
writer = csv.DictWriter(f, TABLEHEAD)
writer.writeheader()
writer.writerows(rows)
helper.py
class Chapter:
def __init__(self, chapter):
self.chapter = chapter
self._chapter_info = None
def parse_all(self):
# 章節號
chapter_num = self.chapter.xpath(
'.//span[contains(@class, "chaptertitle")]/text()')[0]
# 去掉章節號最後的冒號
chapter_num = chapter_num[:-1]
# 章節名
chapter_name = self.chapter.xpath(
'.//span[contains(@class, "chaptername")]/text()')[0]
return chapter_num, chapter_name
@property
def chapter_info(self):
self._chapter_info = self.parse_all()
return self._chapter_info
def get_lessons(self):
return self.chapter.xpath(
'.//div[@data-lesson]')
class Lesson:
def __init__(self, lesson):
self.lesson = lesson
self._lesson_info = None
@property
def lesson_info(self):
# 課時號
lesson_num = self.lesson.xpath(
'.//span[contains(@class, "ks")]/text()')[0]
# 課時名
lesson_name = self.lesson.xpath(
'.//span[@title]/@title')[0]
# 課時長
lesson_len = self.lesson.xpath(
'.//span[contains(@class, "kstime")]/text()')[0]
self._lesson_info = lesson_num, lesson_name, lesson_len
return self._lesson_info
最終結果
最終結果儲存為 courseinfo.csv
,與主執行檔案同路徑。
完成於 2018.11.16