【極客學院】-python學習筆記-3-單執行緒爬蟲 (request安裝遇到問題及解決,應用requests提取資訊)
極客學院課程網址:http://www.jikexueyuan.com/course/821_2.html?ss=1
任務:
爬取極客學院官方網站的課程庫,並儲存
Requests介紹與安裝:
HTTP for Humans
Python的第三方庫,實現網頁連結,更自動,更完善,更友好
只需要4行程式碼,就可以實現
Linux下安裝:sudo pip install requests
撞牆了的時候,用以下方法:
http://www.lfd.uci.edu/~gohlke/pythonlibs/
find requests
download whl file
change whl to zip
two folders: requests and ...info
paste to Python/lib/
安裝requests實踐中遇到問題:
mac python 安裝目錄
http://blog.csdn.net/guo_hongjun1611/article/details/39780089
mac pip安裝
Mac下如何驗證pip包管理工具是否安裝正確?
直接輸入pip回車,如果安裝正確會print出pip的help
Pip 是安裝python包的工具,提供了安裝包,列出已經安裝的包,升級包以及解除安裝包的功能。
Pip 是對easy_install的取代,提供了和easy_install相同的查詢包的功能,
因此可以使用easy_install安裝的包也同樣可以使用pip進行安裝。
安裝Pip
Pip的安裝可以通過原始碼包,easy_install或者指令碼。
安裝 requests 遇到的問題:
$ sudo pip install requests
Password:
sudo: pip: command not found
是不是 pip 沒有裝呢?開始各種查怎麼裝 pip
$ easy_install pip
error: can't create or remove files in install directory The following error occurred while trying to add or remove files in the installation directory: [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/test-easy-install-21725.pth' The installation directory you specified (via --install-dir, --prefix, or the distutils default setting) was: /Library/Python/2.7/site-packages/ Perhaps your account does not have write access to this directory? If the installation directory is a system-owned directory, you may need to sign in as the administrator or "root" account. If you do not have administrative access to this machine, you may wish to choose a different installation directory, preferably one that is listed in your PYTHONPATH environment variable. For information on other options, you may wish to consult the documentation at: https://pythonhosted.org/setuptools/easy_install.html Please make the appropriate changes for your system and try again.
$ sudo easy_install pip
Searching for pip
Reading https://pypi.python.org/simple/pip/
Download error on https://pypi.python.org/simple/pip/: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found!
Couldn't find index page for 'pip' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
Download error on https://pypi.python.org/simple/: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found!
No local packages or download links found for pip
error: Could not find suitable distribution for Requirement.parse('pip')
上面的方法都不管用啊,後來又仔細看看 下面這個帖子
http://stackoverflow.com/questions/17271319/installing-pip-on-mac-os-x在 /usr/local/bin 找到了 pip3,執行下面命令就安裝成功了,有自帶的,就不用安裝了
$ sudo pip3 install requests
Successfully installed requests-2.9.1
Linux下怎樣搜尋檔案
http://jingyan.baidu.com/article/335530dab6fe0919ca41c365.html
比如使用find命令搜尋在根目錄下的所有interfaces檔案所在位置,命令格式為”find / -name 'interfaces'“
它查詢的是資料庫(/var/lib/locatedb),資料庫包含本地所有的檔案資訊。使用locate命令在根目錄下搜尋interfaces檔案的命令為”locate interfaces“
搜尋linux系統中的所有可執行檔案即二進位制檔案。使用whereis命令搜尋grep二進位制檔案的命令為”whereis grep“
檢視系統命令是否存在,並返回系統命令所在的位置。使用which命令檢視grep命令是否存在以及存在的目錄的命令為”which grep“
檢視系統中的某個命令是否為系統自帶的命令。使用type命令檢視cd命令是否為系統自帶的命令;檢視grep 是否為系統自帶的命令 ‘type grep’
$ find / -name 'requests'
但是在python裡 import requests 的時候,還是沒有識別出來,
因為剛才的 requests 包被安在 python 3.5 裡面去了,
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests
而 my mac自帶的是 python 2.7,pycharm上又是 python 2.6
怎麼用 pip 指定位置安裝 pkg 呢
http://stackoverflow.com/questions/2915471/install-a-python-package-into-a-different-directory-using-pip
https://pip.pypa.io/en/latest/reference/pip_install/#cmdoption-t
安裝到哪裡呢
>>> import sys
>>> print sys.path
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7
這個地址下,並沒有site-packages的資料夾試著安裝到下面這個目錄吧
$ cd /Library/Python/2.7/site-packages
$ sudo pip3 install --target=/Library/Python/2.7/site-packages requests
ok, 安裝完畢,自帶的 2.7 也可以用requests了另:比較詳細的 pip 使用例
http://www.ttlsa.com/python/how-to-install-and-use-pip-ttlsa/
安裝requests成功後,開始應用
直接獲取原始碼
修改http頭獲得原始碼
下面程式碼,執行後,可以獲得 python百度貼吧首頁的原始碼
import requests
html=requests.get('http://tieba.baidu.com/f?ie=utf-8&kw=python')
print (html.text)
為什麼下面這個網址,明明有原始碼,可以上面的命令獲取不到呢(其實這個網址,我現在是可以取到的)
http://jp.tingroom.com/yuedu/yd300p/
因為網站會對訪問它的程式進行檢查
所以 加一個headers,相當於加一個面具,讓網站誤以為我們是通過瀏覽器訪問的
headers={'User-Agent':}
#-*-coding:utf8-*-
import requests
#html=requests.get('http://tieba.baidu.com/f?ie=utf-8&kw=python')
#html=requests.get('http://jp.tingroom.com/yuedu/yd300p/')
head={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
html=requests.get('http://jp.tingroom.com/yuedu/yd300p/',headers = head)
html.encoding = 'utf-8'
print (html.text)
headers是個字典,User Agent 是它的 key
如何獲得 User Agent?
審查元素 → Network → 點開任意內容 → Headers → 最下面的 Request Headers 裡面的 User-Agent 複製即可
繼續
單執行緒爬蟲最基本的原理:使用 requests獲得網頁原始碼,再用正則表示式提取自己感興趣的內容
#-*-coding:utf8-*-
import requests
import re
#html=requests.get('http://tieba.baidu.com/f?ie=utf-8&kw=python')
html=requests.get('http://jp.tingroom.com/yuedu/yd300p/')
#head={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
#html=requests.get('http://jp.tingroom.com/yuedu/yd300p/',headers = head)
html.encoding = 'utf-8'
title = re.findall('color:#666666;">(.*?)</span>', html.text, re.S)
for each in title:
print (each)
chinese = re.findall('color: #039;">(.*?)</a>', html.text, re.S)
for each in chinese:
print (each)
#print (html.text)