python批量下載上次論文,還在爬取貼吧圖片?快用批量下載sci論文吧,根據標題名或者DOI批量下載 scihub 科研下載神器
阿新 • • 發佈:2019-01-06
昨晚在下載scil論文,一共295篇,手動下載的話豈不是要累si?
於是想到有沒有批量下載sci論文的。
在web of science 上匯出下載問下的標題、DOI等txt檔案,然後篩選得到DOI和標題,儲存為新檔案。
通過迴圈得到DOI與標題,下載並儲存成標題命名。
程式參考如下網址:
https://github.com/zaytoun/scihub.py
Setup
pip install -r requirements.txt
Usage
You can interact with scihub.py from the commandline:
usage: scihub.py [-h] [-d (DOI|PMID|URL)] [-f path] [-s query] [-sd query] [-l N] [-o path] [-v] SciHub - To remove all barriers in the way of science. optional arguments: -h, --help show this help message and exit -d (DOI|PMID|URL), --download (DOI|PMID|URL) tries to find and download the paper -f path, --file path pass file with list of identifiers and download each -s query, --search query search Google Scholars -sd query, --search_download query search Google Scholars and download if possible -l N, --limit N the number of search results to limit to -o path, --output path directory to store papers -v, --verbose increase output verbosity -p, --proxy set proxy
You can also import scihub. The following examples below demonstrate all the features.
fetch
from scihub import SciHub sh = SciHub() # fetch specific article (don't download to disk) # this will return a dictionary in the form # {'pdf': PDF_DATA, # 'url': SOURCE_URL, # 'name': UNIQUE_GENERATED NAME # } result = sh.fetch('http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1648853')
download
from scihub import SciHub
sh = SciHub()
# exactly the same thing as fetch except downloads the articles to disk
# if no path given, a unique name will be used as the file name
result = sh.download('http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1648853', path='paper.pdf')
search
from scihub import SciHub
sh = SciHub()
# retrieve 5 articles on Google Scholars related to 'bittorrent'
results = sh.search('bittorrent', 5)
# download the papers; will use sci-hub.io if it must
for paper in results['papers']:
sh.download(paper['url'])
但是scihub存在驗證碼問題,驗證碼問題如何解決呢?
存在驗證碼問題
導致爬取失敗,如何解決驗證碼識別問題將是關鍵!!
以後有時間再試試咯!