1. 程式人生 > >用python做youtube自動化下載器 程式碼

用python做youtube自動化下載器 程式碼

[TOC](用python做youtube自動化下載器 程式碼) > 根據 [savefrom條例](https://en.savefrom.net/terms.html) > 本例項及教程只用於學習交流用,權利歸**savefrom.net**所有 > 最後程式碼+註釋大概100行左右,具體程式碼以github程式碼為主(可以會在上面修復bug),本文只做具體講解 # 專案地址 [github倉庫](https://github.com/Nambers/YoutubeDownloader) # 思路 [用python做youtube自動化下載器 思路](https://blog.csdn.net/qq_40832960/article/details/112470584) # 流程 ## 1. post 根據思路里的第一步,我們首先需要用`post`方式取到加密後的js欄位,筆者使用了`requests`第三方庫來執行,關於爬蟲可以參考[我之前的文章](https://blog.csdn.net/qq_40832960/article/details/103854145) ### i. 先把post中的headers格式化 ```python # set the headers or the website will not return information # the cookies in here you may need to change headers = { "cache-Control": "no-cache", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng," "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "accept-encoding": "gzip, deflate, br", "accept-language": "zh-CN,zh;q=0.9,en;q=0.8", "content-type": "application/x-www-form-urlencoded", "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; " "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; " "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; " "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; " "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; " "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1", "origin": "https://en.savefrom.net", "pragma": "no-cache", "referer": "https://en.savefrom.net/1-youtube-video-downloader-4/", "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"", "sec-ch-ua-mobile": "?0", "sec-fetch-dest": "iframe", "sec-fetch-mode": "navigate", "sec-fetch-site": "same-origin", "sec-fetch-user": "?1", "upgrade-insecure-requests": "1", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/87.0.4280.88 Safari/537.36"} ``` 其中`cookie`部分可能要改,然後最好以你們瀏覽器上的為主,具體每個引數的含義不是本文範圍,可以自行去搜索引擎搜 ### ii.然後把引數也格式化 ```python # set the parameter, we can get from chrome kv = {"sf_url": url, "sf_submit": "", "new": "1", "lang": "en", "app": "", "country": "cn", "os": "Windows", "browser": "Chrome"} ``` 其中`sf_url`欄位是我們要下載的youtube視訊的url,其他引數都不變 ### iii. 最後再執行`requests`庫的post請求 ```python # do the POST request r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers, data=kv) r.raise_for_status() ``` 注意是`data=kv` ### iv. 封裝成一個函式 ```python import requests def gethtml(url): # set the headers or the website will not return information # the cookies in here you may need to change headers = { "cache-Control": "no-cache", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng," "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "accept-encoding": "gzip, deflate, br", "accept-language": "zh-CN,zh;q=0.9,en;q=0.8", "content-type": "application/x-www-form-urlencoded", "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; " "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; " "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; " "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; " "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; " "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1", "origin": "https://en.savefrom.net", "pragma": "no-cache", "referer": "https://en.savefrom.net/1-youtube-video-downloader-4/", "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"", "sec-ch-ua-mobile": "?0", "sec-fetch-dest": "iframe", "sec-fetch-mode": "navigate", "sec-fetch-site": "same-origin", "sec-fetch-user": "?1", "upgrade-insecure-requests": "1", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/87.0.4280.88 Safari/537.36"} # set the parameter, we can get from chrome kv = {"sf_url": url, "sf_submit": "", "new": "1", "lang": "en", "app": "", "country": "cn", "os": "Windows", "browser": "Chrome"} # do the POST request r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers, data=kv) r.raise_for_status() # get the result return r.text ``` ## 2. 呼叫解密函式 ## i. 分析 這其中的難點在於在python裡執行javascript程式碼,而晚上的解決方法有`PyV8`等,本文選用`execjs`。在思路部分我們可以發現js部分的最後幾行是解密函式,所以我們只需要在`execjs`中先執行一遍全部,然後再單獨執行解密函式就好了 ## ii. 先取出js部分 ```python # target(youtube address) url url = "https://www.youtube.com/watch?v=YPvtz1lHRiw" # get the target text reo = gethtml(url) # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part) reo = reo.split("")[0] ``` 這裡其實可以用正則,不過由於筆者正則表示式還不太熟練就直接用`split`了 ## iii. 取第一個解密函式作為我們用的解密函式 當你多取幾次不同視訊的結果,你就會發現每次的解密函式都不一樣,不過位置都是還是在固定行數 ```python # split each line(help us find the decrypt function in last few line) reA = reo.split("\n") # get the depcrypt function name = reA[len(reA) - 3].split(";")[0] + ";" ``` 所以`name`就是我們的解密函數了(變數名沒取太好hhh) ## iv. 用execjs執行 ```python # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer) ct = execjs.compile(reo) # do the decryption text = ct.eval(name.split("=")[1].replace(";", "")) ``` 其中只取`=`後面的和去掉分號是指指執行這個函式而不用賦值,當先執行賦值+解密然後取值也不是不可以 但是我們可以發現馬上就報錯了(要是有這麼簡單就好了) ### 1. this也就是window變數不存在 如果沒記錯是報錯`this`或者`$b`,筆者嘗試把全部`this`去掉或者把全部框在一個`class`裡面(這樣子this就變成那個class了)不過都沒有成功,然後發現在`npm`下有個`jsdom`可以在`execjs`裡模擬window變數(其實應該有更好方法的),所以我們需要下載`npm`和裡面的`jsdom`,然後改寫以上程式碼 ```python addition = """ const jsdom = require("jsdom"); const { JSDOM } = jsdom; const dom = new JSDOM(`

Hello world

`); window = dom.window; document = window.document; XMLHttpRequest = window.XMLHttpRequest; """ # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer) ct = execjs.compile(addition + reo, cwd=r'C:\Users\xxx\AppData\Roaming\npm\node_modules') ``` 其中 - `cwd`欄位是`npm root -g`的結果,也就是npm的modules路徑 - `addition`是用來模擬`window`的 但是我們又可以發現下一個錯誤 ### 2. alert不存在 這個錯誤是因為在`execjs`下執行`alert`函式是沒有意義的,因為我們沒有瀏覽器讓他彈窗,且原本`alert`函式的定義是來源`window`而我們自定義了`window`,所以我們要在程式碼前重寫覆蓋`alert`函式(相當於定義一個alert) ```python # override the alert function, because in the code there has one place using # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};") ``` ## v. 整合程式碼 ```python # target(youtube address) url url = "https://www.youtube.com/watch?v=YPvtz1lHRiw" # get the target text reo = gethtml(url) # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part) reo = reo.split("")[0] # override the alert function, because in the code there has one place using # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};") # split each line(help us find the decrypt function in last few line) reA = reo.split("\n") # get the depcrypt function name = reA[len(reA) - 3].split(";")[0] + ";" # add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea) addition = """ const jsdom = require("jsdom"); const { JSDOM } = jsdom; const dom = new JSDOM(`

Hello world

`); window = dom.window; document = window.document; XMLHttpRequest = window.XMLHttpRequest; """ # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer) ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules') # do the decryption text = ct.eval(name.split("=")[1].replace(";", "")) ``` ## 3. 分析解密結果 ### i. 取關鍵json 執行完上面的部分,解密結果就存在text裡了,而我們在思路中可以發現,真正對我們重要的就是存在`window.parent.sf.videoResult.show()`裡的json,所以用正則表示式取這一部分的json ```python # get the result in json result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "") ``` ### ii. 格式化json python可以格式化json的庫有很多,這裡筆者用了`json`庫(記得import) ```python # use `json` to load json j = json.loads(result) ``` ### iii. 取下載地址 接下來就到了最後一步,根據思路里和json格式化工具我們可以發現`j["url"][num]["url"]`就是下載連結,而`num`是我們要的視訊格式(不同解析度和型別) ```python # the selection of video(in this case, num=1 mean the video is # - 360p known from j["url"][num]["quality"] # - MP4 known from j["url"][num]["type"] # - audio known from j["url"][num]["audio"] num = 1 downurl = j["url"][num]["url"] # do some download # thanks :) # - EOF - ``` # 3. 全部程式碼 ```python # -*- coding: utf-8 -*- # @Time: 2021/1/10 # @Author: Eritque arcus # @File: Youtube.py # @License: MIT # @Environment: # - windows 10 # - python 3.6.2 # @Dependence: # - jsdom in npm(windows also can use) # - requests, execjs, re, json in python import requests import execjs import re import json def gethtml(url): # set the headers or the website will not return information # the cookies in here you may need to change headers = { "cache-Control": "no-cache", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng," "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "accept-encoding": "gzip, deflate, br", "accept-language": "zh-CN,zh;q=0.9,en;q=0.8", "content-type": "application/x-www-form-urlencoded", "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; " "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; " "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; " "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; " "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; " "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1", "origin": "https://en.savefrom.net", "pragma": "no-cache", "referer": "https://en.savefrom.net/1-youtube-video-downloader-4/", "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"", "sec-ch-ua-mobile": "?0", "sec-fetch-dest": "iframe", "sec-fetch-mode": "navigate", "sec-fetch-site": "same-origin", "sec-fetch-user": "?1", "upgrade-insecure-requests": "1", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/87.0.4280.88 Safari/537.36"} # set the parameter, we can get from chrome kv = {"sf_url": url, "sf_submit": "", "new": "1", "lang": "en", "app": "", "country": "cn", "os": "Windows", "browser": "Chrome"} # do the POST request r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers, data=kv) r.raise_for_status() # get the result return r.text if __name__ == '__main__': # target(youtube address) url url = "https://www.youtube.com/watch?v=YPvtz1lHRiw" # get the target text reo = gethtml(url) # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part) reo = reo.split("")[0] # override the alert function, because in the code there has one place using # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};") # split each line(help us find the decrypt function in last few line) reA = reo.split("\n") # get the depcrypt function name = reA[len(reA) - 3].split(";")[0] + ";" # add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea) addition = """ const jsdom = require("jsdom"); const { JSDOM } = jsdom; const dom = new JSDOM(`

Hello world

`); window = dom.window; document = window.document; XMLHttpRequest = window.XMLHttpRequest; """ # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer) ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules') # do the decryption text = ct.eval(name.split("=")[1].replace(";", "")) # get the result in json result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "") # use `json` to load json j = json.loads(result) # the selection of video(in this case, num=1 mean the video is # - 360p known from j["url"][num]["quality"] # - MP4 known from j["url"][num]["type"] # - audio known from j["url"][num]["audio"] num = 1 downurl = j["url"][num]["url"] # do some download # thanks :) # - EOF - ``` - 總計102行 - 開發環境 ```python # @Environment: # - windows 10 # - python 3.6.2 ``` - 依賴 ```python # @Dependence: # - jsdom in npm(windows also can use) # - requests, execjs, re, json in python ```
-end-
> For 爬蟲 > 版權宣告:本文為博主原創文章,遵循 CC 4.0 BY-SA 版權協議,轉載請附上原文出處連結和本宣告。 > 本文作者: [https://www.cnblogs.com/Eritque-arcus/](https://www.cnblogs.com/Eritque-arcus/) 或[https://blog.csdn.net/qq_40832960](https://blog.csdn.net/qq_40832960)