1. 程式人生 > >wget命令從kaggle.com下載檔案

wget命令從kaggle.com下載檔案

kaggle.com上的資料集有時候會比較大 ,而且沒有提供網盤下載機制,國內下載速度非常慢,同時下載需要驗證,也無法使用迅雷工具下載。

kaggle論壇上看到有wget的下載方式介紹[1]: 

做法是先登入kaggle.com,記下瀏覽器中的cookie,將cookie儲存到cookies.txt中,執行如下命令:

wget -x --load-cookies cookies.txt -P data -nH --cut-dirs=5 http://www.kaggle.com/c/avazu-ctr-prediction/download/test.gz

但是很快就執行完畢,只下載了14kb,肯定有問題:
[
[email protected]
~]$ wget -x --load-cookies cookies.txt https://www.kaggle.com/c/avazu-ctr-prediction/download/test.gz --2015-11-02 23:35:29-- https://www.kaggle.com/c/avazu-ctr-prediction/download/test.gz Resolving www.kaggle.com (www.kaggle.com)... 168.62.224.124 Connecting to www.kaggle.com (www.kaggle.com)|168.62.224.124|:443... connected. HTTP request sent, awaiting response... 302 Found Location: /account/login?ReturnUrl=%2fc%2favazu-ctr-prediction%2fdownload%2ftest.gz [following] --2015-11-02 23:35:32-- https://www.kaggle.com/account/login?ReturnUrl=%2fc%2favazu-ctr-prediction%2fdownload%2ftest.gz Reusing existing connection to www.kaggle.com:443. HTTP request sent, awaiting response... 200 OK Length: 14687 (14K) [text/html] Saving to: ‘www.kaggle.com/c/avazu-ctr-prediction/download/test.gz’ 100%[===========================================================================================>] 14,687 --.-K/s in 0.03s 2015-11-02 23:35:33 (450 KB/s) - ‘www.kaggle.com/c/avazu-ctr-prediction/download/test.gz’ saved [14687/14687]

由上面的日誌,可見,被重定向了到“https://www.kaggle.com/account/login?ReturnUrl=%2fc%2favazu-ctr-prediction%2fdownload%2ftest.gz”去了。

於是,我們用wget的post資料引數提交使用者名稱、密碼。

[[email protected] ~]$ wget https://www.kaggle.com/account/login?ReturnUrl=%2fc%2favazu-ctr-prediction%2fdownload%2ftest.gz --post-data 'username=login_name&password=login_password'
即可正常下載 :
--2015-11-02 23:37:18--  https://www.kaggle.com/account/login?ReturnUrl=%2fc%2favazu-ctr-prediction%2fdownload%2ftest.gz
Resolving www.kaggle.com (www.kaggle.com)... 168.62.224.124
Connecting to www.kaggle.com (www.kaggle.com)|168.62.224.124|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /c/avazu-ctr-prediction/download/test.gz [following]
--2015-11-02 23:37:19--  https://www.kaggle.com/c/avazu-ctr-prediction/download/test.gz
Reusing existing connection to www.kaggle.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://kaggle2.blob.core.windows.net/competitions-data/kaggle/4120/test.gz?sv=2012-02-12&se=2015-11-05T07%3A39%3A03Z&sr=b&sp=r&sig=rKgKT2uZE6B4sLTirB1qdR8o262a9BgQPh233olSedg%3D [following]
--2015-11-02 23:37:20--  https://kaggle2.blob.core.windows.net/competitions-data/kaggle/4120/test.gz?sv=2012-02-12&se=2015-11-05T07%3A39%3A03Z&sr=b&sp=r&sig=rKgKT2uZE6B4sLTirB1qdR8o262a9BgQPh233olSedg%3D
Resolving kaggle2.blob.core.windows.net (kaggle2.blob.core.windows.net)... 23.98.55.152
Connecting to kaggle2.blob.core.windows.net (kaggle2.blob.core.windows.net)|23.98.55.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 123803952 (118M) [application/x-gzip]
Saving to: ‘login?ReturnUrl=%2Fc%2Favazu-ctr-prediction%2Fdownload%2Ftest.gz’

 7% [======>                                                                                     ] 9,773,056   28.2KB/s  eta 36m 24s^C

這樣,雖然下載速度慢,但是可以放到後臺去執行。

參考:

[1]  https://www.kaggle.com/forums/f/15/kaggle-forum/t/6604/downloading-data-via-command-line