1. 程式人生 > >初試Python爬蟲下載pdf

初試Python爬蟲下載pdf

最近剛學完Boyd的Convex Optimization,真是對Boyd神佩服得五體投地。在他的lecture slides末尾發現原來還有進階課程Stanford的ee364b,那本convex optimization只包括了ee364a,然而ee364b沒有現成的完整slides一次性下載,只好寫個爬蟲挨個下載儲存slides,在ee364b裡的內容更加專業深入,估計實際很少用到。然後我把爬蟲的程式碼貼上來,還好他們的網頁結構比較簡單,程式碼量不大。下載好的檔案裡有些是空白的,回網站一查發現確實是他們沒有在裡面留東西,就這樣吧。

import requests
import re
import os
from bs4 import BeautifulSoup

def GetPage(url):
    page = requests.get(url)
    html = page.text
    return html

def GetList(html):
    soup = BeautifulSoup(html, "html5lib")
    list = soup.find_all(href=re.compile("lectures/"))
    pdfs = []
    for li in list:
        if (li.get('href'))[-4:] == ".pdf":
            pdfs.append(li.get('href'))
    return pdfs
    
def DownloadPdf(pdf,root_url):
    path = "C:/Users/Downloads/cvx/" + pdf[9:]
    urls = root_url + pdf
    r = requests.get(urls)
    f = open(path, "wb")
    f.write(r.content)
    f.close()
    return urls

url = "https://web.stanford.edu/class/ee364b/lectures.html"
root_url = "https://web.stanford.edu/class/ee364b/"
#print(GetList(GetPage(url)))
pdfs = GetList(GetPage(url))
for pdf in pdfs:
    print("Download finished: "+DownloadPdf(pdf, root_url))

還有計劃把Standford的cs224n的lecture slides下載下來慢慢看,就在這個程式碼的基礎上改吧