一、python3 爬蟲環境搭建之 Anaconda 和 Scrapy
阿新 • • 發佈:2019-02-02
python3 只是爬蟲開發的程式語言,開發爬蟲還需要很多其他環境,比如 IDE 工具,常用庫等等. 根據我的使用體驗,推薦如下環境搭建步驟,桌面環境為 Windows 10.
- 安裝 Anaconda
Anaconda 是一個整合度很高的基於 python 的資料科學平臺,無論在開發爬蟲還是機器學習等方面,都遊刃有餘. Anaconda 包含 250 多個數據科學包和自帶的包管理工具 conda,一行命令就可以輕鬆安裝絕大部分依賴庫, 比如 Scikit-Learn, Scipy, Tensorflow 等.
安裝這個軟體跟著提示走就可以,唯一要注意的地方就是軟體的安裝目錄最好是英文的,並且不能有空格.
比較常用的就是這三個應用了, Anaconda 在安裝好後已經為我們配好了自己的系統環境和 python3 的環境,通常安裝依賴的話只需要在命令列終端 Anaconda Prompt 直接執行 conda 命令就好.
比如,可以使用下面的命令檢視當前配置的環境路徑:
>conda env list
# conda environments:
#
base * D:\ProgramFiles\Anaconda
使用下面的命令檢視不同路徑下的 python:
>where python D:\ProgramFiles\Anaconda\python.exe
檢視當前使用的 python 的版本資訊:
>python --version
Python 3.6.3 :: Anaconda custom (64-bit)
檢視當前環境下已經安裝好的包:
>conda list # packages in environment at D:\ProgramFiles\Anaconda: # # Name Version Build Channel _ipyw_jlab_nb_ext_conf 0.1.0 py36he6757f0_0 alabaster 0.7.10 py36hcd07829_0 anaconda custom py36h363777c_0 anaconda-client 1.6.14 py36_0 anaconda-navigator 1.8.3 py36_0 anaconda-project 0.8.0 py36h8b3bf89_0 asn1crypto 0.22.0 py36h8e79faa_1 astroid 1.5.3 py36h9d85297_0 astropy 2.0.2 py36h06391c4_4 babel 2.5.0 py36h35444c1_0 backports 1.0 py36h81696a8_1 backports.shutil_get_terminal_size 1.0.0 py36h79ab834_2 beautifulsoup4 4.6.0 py36hd4cc5e8_1 bitarray 0.8.1 py36h6af124b_0 bkcharts 0.2 py36h7e685f7_0 blaze 0.11.3 py36h8a29ca5_0 bleach 2.0.0 py36h0a7e3d6_0 bokeh 0.12.10 py36h0be3b39_0 boto 2.48.0 py36h1a776d2_1 bottleneck 1.2.1 py36hd119dfa_0 bzip2 1.0.6 vc14hdec8e7a_1 [vc14] ca-certificates 2017.08.26 h94faf87_0 cachecontrol 0.12.3 py36hfe50d7b_0 certifi 2017.7.27.1 py36h043bc9e_0 cffi 1.10.0 py36hae3d1b5_1 chardet 3.0.4 py36h420ce6e_1 click 6.7 py36hec8c647_0 cloudpickle 0.4.0 py36h639d8dc_0 clyent 1.2.2 py36hb10d595_1 colorama 0.3.9 py36h029ae33_0 comtypes 1.1.2 py36heb9b3d1_0 conda 4.5.1 py36_0 conda-build 3.0.27 py36h309a530_0 conda-env 2.6.0 h36134e3_1 conda-verify 2.0.0 py36h065de53_0 console_shortcut 0.1.1 h6bb2dd7_3 contextlib2 0.5.5 py36he5d52c0_0 cryptography 2.0.3 py36h123decb_1 curl 7.55.1 vc14hdaba4a4_3 [vc14] cycler 0.10.0 py36h009560c_0 cython 0.26.1 py36h18049ac_0 cytoolz 0.8.2 py36h547e66e_0 dask 0.15.3 py36h396fcb9_0 dask-core 0.15.3 py36hd651449_0 datashape 0.5.4 py36h5770b85_0 decorator 4.1.2 py36he63a57b_0 distlib 0.2.5 py36h51371be_0 distributed 1.19.1 py36h8504682_0 docutils 0.14 py36h6012d8f_0 entrypoints 0.2.3 py36hfd66bb0_2 et_xmlfile 1.0.1 py36h3d2d736_0 fastcache 1.0.2 py36hffdae1b_0 filelock 2.0.12 py36hd7ddd41_0 flask 0.12.2 py36h98b5e8f_0 flask-cors 3.0.3 py36h8a3855d_0 freetype 2.8 vc14h17c9bdf_0 [vc14] get_terminal_size 1.0.0 h38e98db_0 gevent 1.2.2 py36h342a76c_0 glob2 0.5 py36h11cc1bd_1 greenlet 0.4.12 py36ha00ad21_0 h5py 2.7.0 py36hfbe0a52_1 hdf5 1.10.1 vc14hb361328_0 [vc14] heapdict 1.0.0 py36h21fa5f4_0 html5lib 0.999999999 py36ha09b1f3_0 icc_rt 2017.0.4 h97af966_0 icu 58.2 vc14hc45fdbb_0 [vc14] idna 2.6 py36h148d497_1 imageio 2.2.0 py36had6c2d2_0 imagesize 0.7.1 py36he29f638_0 intel-openmp 2018.0.0 hcd89f80_7 ipykernel 4.6.1 py36hbb77b34_0 ipython 6.1.0 py36h236ecc8_1 ipython_genutils 0.2.0 py36h3c5d0ee_0 ipywidgets 7.0.0 py36h2e74ada_0 isort 4.2.15 py36h6198cc5_0 itsdangerous 0.24 py36hb6c5a24_1 jdcal 1.3 py36h64a5255_0 jedi 0.10.2 py36hed927a0_0 jinja2 2.9.6 py36h10aa3a0_1 jpeg 9b vc14h4d7706e_1 [vc14] jsonschema 2.6.0 py36h7636477_0 jupyter 1.0.0 py36h422fd7e_2 jupyter_client 5.1.0 py36h9902a9a_0 jupyter_console 5.2.0 py36h6d89b47_1 jupyter_core 4.3.0 py36h511e818_0 jupyterlab 0.27.0 py36h34cc53b_2 jupyterlab_launcher 0.4.0 py36h22c3ccf_0 lazy-object-proxy 1.3.1 py36hd1c21d2_0 libiconv 1.15 vc14h29686d3_5 [vc14] libpng 1.6.32 vc14h5163883_3 [vc14] libssh2 1.8.0 vc14hcf584a9_2 [vc14] libtiff 4.0.8 vc14h04e2a1e_10 [vc14] libxml2 2.9.4 vc14h8fd0f11_5 [vc14] libxslt 1.1.29 vc14hf85b8d4_5 [vc14] llvmlite 0.20.0 py36_0 locket 0.2.0 py36hfed976d_1 lockfile 0.12.2 py36h0468280_0 lxml 4.1.0 py36h0dcd83c_0 lzo 2.10 vc14h0a64fa6_1 [vc14] markupsafe 1.0 py36h0e26971_1 matplotlib 2.1.0 py36h11b4b9c_0 mccabe 0.6.1 py36hb41005a_1 menuinst 1.4.10 py36h42196fb_0 mistune 0.7.4 py36h4874169_0 mkl 2018.0.0 h36b65af_4 mkl-service 1.1.2 py36h57e144c_4 mpmath 0.19 py36he326802_2 msgpack-python 0.4.8 py36h58b1e9d_0 multipledispatch 0.4.9 py36he44c36e_0 navigator-updater 0.1.0 py36h8a7b86b_0 nbconvert 5.3.1 py36h8dc0fde_0 nbformat 4.4.0 py36h3a5bc1b_0 networkx 2.0 py36hff991e3_0 nltk 3.2.4 py36hd0e0a39_0 nose 1.3.7 py36h1c3779e_2 notebook 5.0.0 py36hd9fbf6f_2 numba 0.35.0 np113py36_10 numexpr 2.6.2 py36h7ca04dc_1 numpy 1.13.3 py36ha320f96_0 numpydoc 0.7.0 py36ha25429e_0 odo 0.5.1 py36h7560279_0 olefile 0.44 py36h0a7bdd2_0 openpyxl 2.4.8 py36hf3b77f6_1 openssl 1.0.2l vc14hcac20b0_2 [vc14] packaging 16.8 py36ha0986f6_1 pandas 0.20.3 py36hce827b7_2 pandoc 1.19.2.1 hb2460c7_1 pandocfilters 1.4.2 py36h3ef6317_1 partd 0.3.8 py36hc8e763b_0 path.py 10.3.1 py36h3dd8b46_0 pathlib2 2.3.0 py36h7bfb78b_0 patsy 0.4.1 py36h42cefec_0 pep8 1.7.0 py36h0f3d67a_0 pickleshare 0.7.4 py36h9de030f_0 pillow 4.2.1 py36hdb25ab2_0 pip 9.0.1 py36hadba87b_3 pkginfo 1.4.1 py36hb0f9cfa_1 ply 3.10 py36h1211beb_0 progress 1.3 py36hbeca8d3_0 prompt_toolkit 1.0.15 py36h60b8f86_0 psutil 5.4.0 py36h4e662fb_0 py 1.4.34 py36ha4aca3a_1 pycodestyle 2.3.1 py36h7cc55cd_0 pycosat 0.6.3 py36h413d8a4_0 pycparser 2.18 py36hd053e01_1 pycrypto 2.6.1 py36he68e6e2_1 pycurl 7.43.0 py36h086bf4c_3 pyflakes 1.6.0 py36h0b975d6_0 pygments 2.2.0 py36hb010967_0 pylint 1.7.4 py36ha4e6ded_0 pyodbc 4.0.17 py36h0006bc2_0 pyopenssl 17.2.0 py36h15ca2fc_0 pyparsing 2.2.0 py36h785a196_1 pyqt 5.6.0 py36hb5ed885_5 pysocks 1.6.7 py36h698d350_1 pytables 3.4.2 py36h71138e3_2 pytest 3.2.1 py36h753b05e_1 python 3.6.3 h9e2ca53_1 python-dateutil 2.6.1 py36h509ddcb_1 pytz 2017.2 py36h05d413f_1 pywavelets 0.5.2 py36hc649158_0 pywin32 221 py36h9c10281_0 pyyaml 3.12 py36h1d1928f_1 pyzmq 16.0.2 py36h38c27d9_2 qt 5.6.2 vc14h6f8c307_12 [vc14] qtawesome 0.4.4 py36h5aa48f6_0 qtconsole 4.3.1 py36h99a29a9_0 qtpy 1.3.1 py36hb8717c5_0 requests 2.18.4 py36h4371aae_1 rope 0.10.5 py36hcaf5641_0 ruamel_yaml 0.11.14 py36h9b16331_2 scikit-image 0.13.0 py36h6dffa3f_1 scikit-learn 0.19.1 py36h53aea1b_0 scipy 0.19.1 py36h7565378_3 seaborn 0.8.0 py36h62cb67c_0 setuptools 36.5.0 py36h65f9e6e_0 simplegeneric 0.8.1 py36heab741f_0 singledispatch 3.4.0.3 py36h17d0c80_0 sip 4.18.1 py36h9c25514_2 six 1.11.0 py36h4db2310_1 snowballstemmer 1.2.1 py36h763602f_0 sortedcollections 0.5.3 py36hbefa0ab_0 sortedcontainers 1.5.7 py36ha90ac20_0 sphinx 1.6.3 py36h9bb690b_0 sphinxcontrib 1.0 py36hbbac3d2_1 sphinxcontrib-websupport 1.0.1 py36hb5e5916_1 spyder 3.2.4 py36h8845eaa_0 sqlalchemy 1.1.13 py36h5948d12_0 sqlite 3.20.1 vc14h7ce8c62_1 [vc14] statsmodels 0.8.0 py36h6189b4c_0 sympy 1.1.1 py36h96708e0_0 tblib 1.3.2 py36h30f5020_0 testpath 0.3.1 py36h2698cfe_0 tk 8.6.7 vc14hb68737d_1 [vc14] toolz 0.8.2 py36he152a52_0 tornado 4.5.2 py36h57f6048_0 traitlets 4.3.2 py36h096827d_0 typing 3.6.2 py36hb035bda_0 unicodecsv 0.14.1 py36h6450c06_0 urllib3 1.22 py36h276f60a_0 vc 14 h2379b0c_2 vs2015_runtime 14.0.25123 hd4c4e62_2 wcwidth 0.1.7 py36h3d5aa90_0 webencodings 0.5.1 py36h67c50ae_1 werkzeug 0.12.2 py36h866a736_0 wheel 0.29.0 py36h6ce6cde_1 widgetsnbextension 3.0.2 py36h364476f_1 win_inet_pton 1.0.1 py36he67d7fd_1 win_unicode_console 0.5 py36hcdbd4b5_0 wincertstore 0.2 py36h7fe50ca_0 wrapt 1.10.11 py36he5f5981_0 xlrd 1.1.0 py36h1cb58dc_1 xlsxwriter 1.0.2 py36hf723b7d_0 xlwings 0.11.4 py36hd3cf94d_0 xlwt 1.3.0 py36h1a4751e_0 yaml 0.1.7 vc14hb31d195_1 [vc14] zict 0.1.3 py36h2d8e73e_0 zlib 1.2.11 vc14h1cdd9ab_1 [vc14]
- 安裝 Scrapy
Scrapy 是爬蟲的常用框架之一, 官網的安裝提示如下:
conda install -c conda-forge scrapy
但是,我按照上述方法安裝後出現如下問題:
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/win-64/libssh2-1.8.0-vc14_2.tar.bz2>
Elapsed: -
An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/noarch/hyperlink-17.3.1-py_0.tar.bz2>
Elapsed: -
An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/win-64/pydispatcher-2.0.5-py36_0.tar.bz2>
Elapsed: -
An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/win-64/yaml-0.1.7-vc14_0.tar.bz2>
Elapsed: -
An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/win-64/qt-5.6.2-vc14_1.tar.bz2>
Elapsed: -
An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
有一些包安裝失敗,原因可能是上述命令使用的資源通道下載速度太慢導致連線不上,於是改用如下方法:
先檢視 conda 上有沒有提供當前 python 版本的 scrapy 包
>conda search scrapy
Loading channels: done
# Name Version Build Channel
scrapy 0.16.4 py26_0 pkgs/free
scrapy 0.16.4 py27_0 pkgs/free
scrapy 0.24.4 py27_0 pkgs/free
scrapy 1.0.1 py27_0 pkgs/free
scrapy 1.0.3 py27_0 pkgs/free
scrapy 1.1.1 py27_0 pkgs/free
scrapy 1.1.1 py34_0 pkgs/free
scrapy 1.1.1 py35_0 pkgs/free
scrapy 1.1.1 py36_0 pkgs/free
scrapy 1.3.3 py27_0 pkgs/free
scrapy 1.3.3 py35_0 pkgs/free
scrapy 1.3.3 py36_0 pkgs/free
scrapy 1.4.0 py27h4eaa785_1 pkgs/main
scrapy 1.4.0 py35h054a469_1 pkgs/main
scrapy 1.4.0 py36h764da0a_1 pkgs/main
scrapy 1.5.0 py27_0 pkgs/main
scrapy 1.5.0 py35_0 pkgs/main
scrapy 1.5.0 py36_0 pkgs/main
我的 python 版本是 3.6,可以看到列表最下面一行就是 python3.6 最新的 scrapy 版本,於是使用如下命令安裝:
>conda install scrapy
Solving environment: done
## Package Plan ##
environment location: D:\ProgramFiles\Anaconda
added / updated specs:
- scrapy
The following packages will be downloaded:
package | build
---------------------------|-----------------
attrs-17.4.0 | py36_0 41 KB
pyasn1-0.4.2 | py36h22e697c_0 101 KB
hyperlink-18.0.0 | py36_0 62 KB
openssl-1.0.2o | h8ea7d77_0 5.4 MB
pyasn1-modules-0.2.1 | py36hd1453cb_0 86 KB
pytest-runner-4.2 | py36_0 12 KB
ca-certificates-2018.03.07 | 0 155 KB
scrapy-1.5.0 | py36_0 329 KB
automat-0.6.0 | py36hc6d8c19_0 67 KB
constantly-15.1.0 | py36_0 13 KB
cssselect-1.0.3 | py36_0 28 KB
incremental-17.5.0 | py36he5b1da3_0 25 KB
certifi-2018.4.16 | py36_0 143 KB
pydispatcher-2.0.5 | py36_0 18 KB
------------------------------------------------------------
Total: 6.4 MB
The following NEW packages will be INSTALLED:
attrs: 17.4.0-py36_0
automat: 0.6.0-py36hc6d8c19_0
constantly: 15.1.0-py36_0
cssselect: 1.0.3-py36_0
hyperlink: 18.0.0-py36_0
incremental: 17.5.0-py36he5b1da3_0
parsel: 1.4.0-py36_0
pyasn1: 0.4.2-py36h22e697c_0
pyasn1-modules: 0.2.1-py36hd1453cb_0
pydispatcher: 2.0.5-py36_0
pytest-runner: 4.2-py36_0
queuelib: 1.5.0-py36_0
scrapy: 1.5.0-py36_0
service_identity: 17.0.0-py36_0
twisted: 17.5.0-py36_0
w3lib: 1.19.0-py36_0
zope: 1.0-py36_0
zope.interface: 4.5.0-py36hfa6e2cd_0
The following packages will be UPDATED:
ca-certificates: 2017.08.26-h94faf87_0 --> 2018.03.07-0
certifi: 2017.7.27.1-py36h043bc9e_0 --> 2018.4.16-py36_0
openssl: 1.0.2l-vc14hcac20b0_2 --> 1.0.2o-h8ea7d77_0
Proceed ([y]/n)? y
選擇 y 後繼續安裝:
Downloading and Extracting Packages
attrs 17.4.0################################################################################################### | 100%
pyasn1 0.4.2################################################################################################### | 100%
hyperlink 18.0.0############################################################################################### | 100%
openssl 1.0.2o################################################################################################# | 100%
pyasn1-modules 0.2.1########################################################################################### | 100%
pytest-runner 4.2############################################################################################## | 100%
ca-certificates 2018.03.07##################################################################################### | 100%
scrapy 1.5.0################################################################################################### | 100%
automat 0.6.0################################################################################################## | 100%
constantly 15.1.0############################################################################################## | 100%
cssselect 1.0.3################################################################################################ | 100%
incremental 17.5.0############################################################################################# | 100%
certifi 2018.4.16############################################################################################## | 100%
pydispatcher 2.0.5############################################################################################# | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
安裝完成,最後可以通過
>conda list
命令檢視 scrapy 是否安裝成功。