1. 程式人生 > >一、python3 爬蟲環境搭建之 Anaconda 和 Scrapy

一、python3 爬蟲環境搭建之 Anaconda 和 Scrapy

python3 只是爬蟲開發的程式語言,開發爬蟲還需要很多其他環境,比如 IDE 工具,常用庫等等. 根據我的使用體驗,推薦如下環境搭建步驟,桌面環境為 Windows 10.

  • 安裝 Anaconda

Anaconda 是一個整合度很高的基於 python 的資料科學平臺,無論在開發爬蟲還是機器學習等方面,都遊刃有餘. Anaconda 包含 250 多個數據科學包和自帶的包管理工具 conda,一行命令就可以輕鬆安裝絕大部分依賴庫, 比如 Scikit-Learn, Scipy, Tensorflow 等.  

安裝這個軟體跟著提示走就可以,唯一要注意的地方就是軟體的安裝目錄最好是英文的,並且不能有空格.

安裝好後找到如下圖所示三個圖示.


比較常用的就是這三個應用了, Anaconda 在安裝好後已經為我們配好了自己的系統環境和 python3 的環境,通常安裝依賴的話只需要在命令列終端 Anaconda Prompt 直接執行 conda 命令就好.

比如,可以使用下面的命令檢視當前配置的環境路徑:

>conda env list
# conda environments:
#
base                  *  D:\ProgramFiles\Anaconda

使用下面的命令檢視不同路徑下的 python:

>where python
D:\ProgramFiles\Anaconda\python.exe

檢視當前使用的 python 的版本資訊:

>python --version
Python 3.6.3 :: Anaconda custom (64-bit)

檢視當前環境下已經安裝好的包:

>conda list
# packages in environment at D:\ProgramFiles\Anaconda:
#
# Name                    Version                   Build  Channel
_ipyw_jlab_nb_ext_conf    0.1.0            py36he6757f0_0
alabaster                 0.7.10           py36hcd07829_0
anaconda                  custom           py36h363777c_0
anaconda-client           1.6.14                   py36_0
anaconda-navigator        1.8.3                    py36_0
anaconda-project          0.8.0            py36h8b3bf89_0
asn1crypto                0.22.0           py36h8e79faa_1
astroid                   1.5.3            py36h9d85297_0
astropy                   2.0.2            py36h06391c4_4
babel                     2.5.0            py36h35444c1_0
backports                 1.0              py36h81696a8_1
backports.shutil_get_terminal_size 1.0.0            py36h79ab834_2
beautifulsoup4            4.6.0            py36hd4cc5e8_1
bitarray                  0.8.1            py36h6af124b_0
bkcharts                  0.2              py36h7e685f7_0
blaze                     0.11.3           py36h8a29ca5_0
bleach                    2.0.0            py36h0a7e3d6_0
bokeh                     0.12.10          py36h0be3b39_0
boto                      2.48.0           py36h1a776d2_1
bottleneck                1.2.1            py36hd119dfa_0
bzip2                     1.0.6            vc14hdec8e7a_1  [vc14]
ca-certificates           2017.08.26           h94faf87_0
cachecontrol              0.12.3           py36hfe50d7b_0
certifi                   2017.7.27.1      py36h043bc9e_0
cffi                      1.10.0           py36hae3d1b5_1
chardet                   3.0.4            py36h420ce6e_1
click                     6.7              py36hec8c647_0
cloudpickle               0.4.0            py36h639d8dc_0
clyent                    1.2.2            py36hb10d595_1
colorama                  0.3.9            py36h029ae33_0
comtypes                  1.1.2            py36heb9b3d1_0
conda                     4.5.1                    py36_0
conda-build               3.0.27           py36h309a530_0
conda-env                 2.6.0                h36134e3_1
conda-verify              2.0.0            py36h065de53_0
console_shortcut          0.1.1                h6bb2dd7_3
contextlib2               0.5.5            py36he5d52c0_0
cryptography              2.0.3            py36h123decb_1
curl                      7.55.1           vc14hdaba4a4_3  [vc14]
cycler                    0.10.0           py36h009560c_0
cython                    0.26.1           py36h18049ac_0
cytoolz                   0.8.2            py36h547e66e_0
dask                      0.15.3           py36h396fcb9_0
dask-core                 0.15.3           py36hd651449_0
datashape                 0.5.4            py36h5770b85_0
decorator                 4.1.2            py36he63a57b_0
distlib                   0.2.5            py36h51371be_0
distributed               1.19.1           py36h8504682_0
docutils                  0.14             py36h6012d8f_0
entrypoints               0.2.3            py36hfd66bb0_2
et_xmlfile                1.0.1            py36h3d2d736_0
fastcache                 1.0.2            py36hffdae1b_0
filelock                  2.0.12           py36hd7ddd41_0
flask                     0.12.2           py36h98b5e8f_0
flask-cors                3.0.3            py36h8a3855d_0
freetype                  2.8              vc14h17c9bdf_0  [vc14]
get_terminal_size         1.0.0                h38e98db_0
gevent                    1.2.2            py36h342a76c_0
glob2                     0.5              py36h11cc1bd_1
greenlet                  0.4.12           py36ha00ad21_0
h5py                      2.7.0            py36hfbe0a52_1
hdf5                      1.10.1           vc14hb361328_0  [vc14]
heapdict                  1.0.0            py36h21fa5f4_0
html5lib                  0.999999999      py36ha09b1f3_0
icc_rt                    2017.0.4             h97af966_0
icu                       58.2             vc14hc45fdbb_0  [vc14]
idna                      2.6              py36h148d497_1
imageio                   2.2.0            py36had6c2d2_0
imagesize                 0.7.1            py36he29f638_0
intel-openmp              2018.0.0             hcd89f80_7
ipykernel                 4.6.1            py36hbb77b34_0
ipython                   6.1.0            py36h236ecc8_1
ipython_genutils          0.2.0            py36h3c5d0ee_0
ipywidgets                7.0.0            py36h2e74ada_0
isort                     4.2.15           py36h6198cc5_0
itsdangerous              0.24             py36hb6c5a24_1
jdcal                     1.3              py36h64a5255_0
jedi                      0.10.2           py36hed927a0_0
jinja2                    2.9.6            py36h10aa3a0_1
jpeg                      9b               vc14h4d7706e_1  [vc14]
jsonschema                2.6.0            py36h7636477_0
jupyter                   1.0.0            py36h422fd7e_2
jupyter_client            5.1.0            py36h9902a9a_0
jupyter_console           5.2.0            py36h6d89b47_1
jupyter_core              4.3.0            py36h511e818_0
jupyterlab                0.27.0           py36h34cc53b_2
jupyterlab_launcher       0.4.0            py36h22c3ccf_0
lazy-object-proxy         1.3.1            py36hd1c21d2_0
libiconv                  1.15             vc14h29686d3_5  [vc14]
libpng                    1.6.32           vc14h5163883_3  [vc14]
libssh2                   1.8.0            vc14hcf584a9_2  [vc14]
libtiff                   4.0.8           vc14h04e2a1e_10  [vc14]
libxml2                   2.9.4            vc14h8fd0f11_5  [vc14]
libxslt                   1.1.29           vc14hf85b8d4_5  [vc14]
llvmlite                  0.20.0                   py36_0
locket                    0.2.0            py36hfed976d_1
lockfile                  0.12.2           py36h0468280_0
lxml                      4.1.0            py36h0dcd83c_0
lzo                       2.10             vc14h0a64fa6_1  [vc14]
markupsafe                1.0              py36h0e26971_1
matplotlib                2.1.0            py36h11b4b9c_0
mccabe                    0.6.1            py36hb41005a_1
menuinst                  1.4.10           py36h42196fb_0
mistune                   0.7.4            py36h4874169_0
mkl                       2018.0.0             h36b65af_4
mkl-service               1.1.2            py36h57e144c_4
mpmath                    0.19             py36he326802_2
msgpack-python            0.4.8            py36h58b1e9d_0
multipledispatch          0.4.9            py36he44c36e_0
navigator-updater         0.1.0            py36h8a7b86b_0
nbconvert                 5.3.1            py36h8dc0fde_0
nbformat                  4.4.0            py36h3a5bc1b_0
networkx                  2.0              py36hff991e3_0
nltk                      3.2.4            py36hd0e0a39_0
nose                      1.3.7            py36h1c3779e_2
notebook                  5.0.0            py36hd9fbf6f_2
numba                     0.35.0             np113py36_10
numexpr                   2.6.2            py36h7ca04dc_1
numpy                     1.13.3           py36ha320f96_0
numpydoc                  0.7.0            py36ha25429e_0
odo                       0.5.1            py36h7560279_0
olefile                   0.44             py36h0a7bdd2_0
openpyxl                  2.4.8            py36hf3b77f6_1
openssl                   1.0.2l           vc14hcac20b0_2  [vc14]
packaging                 16.8             py36ha0986f6_1
pandas                    0.20.3           py36hce827b7_2
pandoc                    1.19.2.1             hb2460c7_1
pandocfilters             1.4.2            py36h3ef6317_1
partd                     0.3.8            py36hc8e763b_0
path.py                   10.3.1           py36h3dd8b46_0
pathlib2                  2.3.0            py36h7bfb78b_0
patsy                     0.4.1            py36h42cefec_0
pep8                      1.7.0            py36h0f3d67a_0
pickleshare               0.7.4            py36h9de030f_0
pillow                    4.2.1            py36hdb25ab2_0
pip                       9.0.1            py36hadba87b_3
pkginfo                   1.4.1            py36hb0f9cfa_1
ply                       3.10             py36h1211beb_0
progress                  1.3              py36hbeca8d3_0
prompt_toolkit            1.0.15           py36h60b8f86_0
psutil                    5.4.0            py36h4e662fb_0
py                        1.4.34           py36ha4aca3a_1
pycodestyle               2.3.1            py36h7cc55cd_0
pycosat                   0.6.3            py36h413d8a4_0
pycparser                 2.18             py36hd053e01_1
pycrypto                  2.6.1            py36he68e6e2_1
pycurl                    7.43.0           py36h086bf4c_3
pyflakes                  1.6.0            py36h0b975d6_0
pygments                  2.2.0            py36hb010967_0
pylint                    1.7.4            py36ha4e6ded_0
pyodbc                    4.0.17           py36h0006bc2_0
pyopenssl                 17.2.0           py36h15ca2fc_0
pyparsing                 2.2.0            py36h785a196_1
pyqt                      5.6.0            py36hb5ed885_5
pysocks                   1.6.7            py36h698d350_1
pytables                  3.4.2            py36h71138e3_2
pytest                    3.2.1            py36h753b05e_1
python                    3.6.3                h9e2ca53_1
python-dateutil           2.6.1            py36h509ddcb_1
pytz                      2017.2           py36h05d413f_1
pywavelets                0.5.2            py36hc649158_0
pywin32                   221              py36h9c10281_0
pyyaml                    3.12             py36h1d1928f_1
pyzmq                     16.0.2           py36h38c27d9_2
qt                        5.6.2           vc14h6f8c307_12  [vc14]
qtawesome                 0.4.4            py36h5aa48f6_0
qtconsole                 4.3.1            py36h99a29a9_0
qtpy                      1.3.1            py36hb8717c5_0
requests                  2.18.4           py36h4371aae_1
rope                      0.10.5           py36hcaf5641_0
ruamel_yaml               0.11.14          py36h9b16331_2
scikit-image              0.13.0           py36h6dffa3f_1
scikit-learn              0.19.1           py36h53aea1b_0
scipy                     0.19.1           py36h7565378_3
seaborn                   0.8.0            py36h62cb67c_0
setuptools                36.5.0           py36h65f9e6e_0
simplegeneric             0.8.1            py36heab741f_0
singledispatch            3.4.0.3          py36h17d0c80_0
sip                       4.18.1           py36h9c25514_2
six                       1.11.0           py36h4db2310_1
snowballstemmer           1.2.1            py36h763602f_0
sortedcollections         0.5.3            py36hbefa0ab_0
sortedcontainers          1.5.7            py36ha90ac20_0
sphinx                    1.6.3            py36h9bb690b_0
sphinxcontrib             1.0              py36hbbac3d2_1
sphinxcontrib-websupport  1.0.1            py36hb5e5916_1
spyder                    3.2.4            py36h8845eaa_0
sqlalchemy                1.1.13           py36h5948d12_0
sqlite                    3.20.1           vc14h7ce8c62_1  [vc14]
statsmodels               0.8.0            py36h6189b4c_0
sympy                     1.1.1            py36h96708e0_0
tblib                     1.3.2            py36h30f5020_0
testpath                  0.3.1            py36h2698cfe_0
tk                        8.6.7            vc14hb68737d_1  [vc14]
toolz                     0.8.2            py36he152a52_0
tornado                   4.5.2            py36h57f6048_0
traitlets                 4.3.2            py36h096827d_0
typing                    3.6.2            py36hb035bda_0
unicodecsv                0.14.1           py36h6450c06_0
urllib3                   1.22             py36h276f60a_0
vc                        14                   h2379b0c_2
vs2015_runtime            14.0.25123           hd4c4e62_2
wcwidth                   0.1.7            py36h3d5aa90_0
webencodings              0.5.1            py36h67c50ae_1
werkzeug                  0.12.2           py36h866a736_0
wheel                     0.29.0           py36h6ce6cde_1
widgetsnbextension        3.0.2            py36h364476f_1
win_inet_pton             1.0.1            py36he67d7fd_1
win_unicode_console       0.5              py36hcdbd4b5_0
wincertstore              0.2              py36h7fe50ca_0
wrapt                     1.10.11          py36he5f5981_0
xlrd                      1.1.0            py36h1cb58dc_1
xlsxwriter                1.0.2            py36hf723b7d_0
xlwings                   0.11.4           py36hd3cf94d_0
xlwt                      1.3.0            py36h1a4751e_0
yaml                      0.1.7            vc14hb31d195_1  [vc14]
zict                      0.1.3            py36h2d8e73e_0
zlib                      1.2.11           vc14h1cdd9ab_1  [vc14]
  • 安裝 Scrapy

Scrapy 是爬蟲的常用框架之一, 官網的安裝提示如下:

conda install -c conda-forge scrapy

但是,我按照上述方法安裝後出現如下問題:

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/win-64/libssh2-1.8.0-vc14_2.tar.bz2>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/noarch/hyperlink-17.3.1-py_0.tar.bz2>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/win-64/pydispatcher-2.0.5-py36_0.tar.bz2>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/win-64/yaml-0.1.7-vc14_0.tar.bz2>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/win-64/qt-5.6.2-vc14_1.tar.bz2>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.

有一些包安裝失敗,原因可能是上述命令使用的資源通道下載速度太慢導致連線不上,於是改用如下方法:

先檢視 conda 上有沒有提供當前 python 版本的 scrapy 包

>conda search scrapy
Loading channels: done
# Name                  Version           Build  Channel
scrapy                   0.16.4          py26_0  pkgs/free
scrapy                   0.16.4          py27_0  pkgs/free
scrapy                   0.24.4          py27_0  pkgs/free
scrapy                    1.0.1          py27_0  pkgs/free
scrapy                    1.0.3          py27_0  pkgs/free
scrapy                    1.1.1          py27_0  pkgs/free
scrapy                    1.1.1          py34_0  pkgs/free
scrapy                    1.1.1          py35_0  pkgs/free
scrapy                    1.1.1          py36_0  pkgs/free
scrapy                    1.3.3          py27_0  pkgs/free
scrapy                    1.3.3          py35_0  pkgs/free
scrapy                    1.3.3          py36_0  pkgs/free
scrapy                    1.4.0  py27h4eaa785_1  pkgs/main
scrapy                    1.4.0  py35h054a469_1  pkgs/main
scrapy                    1.4.0  py36h764da0a_1  pkgs/main
scrapy                    1.5.0          py27_0  pkgs/main
scrapy                    1.5.0          py35_0  pkgs/main
scrapy                    1.5.0          py36_0  pkgs/main

我的 python 版本是 3.6,可以看到列表最下面一行就是 python3.6 最新的 scrapy 版本,於是使用如下命令安裝:

>conda install scrapy
Solving environment: done

## Package Plan ##

  environment location: D:\ProgramFiles\Anaconda

  added / updated specs:
    - scrapy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    attrs-17.4.0               |           py36_0          41 KB
    pyasn1-0.4.2               |   py36h22e697c_0         101 KB
    hyperlink-18.0.0           |           py36_0          62 KB
    openssl-1.0.2o             |       h8ea7d77_0         5.4 MB
    pyasn1-modules-0.2.1       |   py36hd1453cb_0          86 KB
    pytest-runner-4.2          |           py36_0          12 KB
    ca-certificates-2018.03.07 |                0         155 KB
    scrapy-1.5.0               |           py36_0         329 KB
    automat-0.6.0              |   py36hc6d8c19_0          67 KB
    constantly-15.1.0          |           py36_0          13 KB
    cssselect-1.0.3            |           py36_0          28 KB
    incremental-17.5.0         |   py36he5b1da3_0          25 KB
    certifi-2018.4.16          |           py36_0         143 KB
    pydispatcher-2.0.5         |           py36_0          18 KB
    ------------------------------------------------------------
                                           Total:         6.4 MB

The following NEW packages will be INSTALLED:

    attrs:            17.4.0-py36_0
    automat:          0.6.0-py36hc6d8c19_0
    constantly:       15.1.0-py36_0
    cssselect:        1.0.3-py36_0
    hyperlink:        18.0.0-py36_0
    incremental:      17.5.0-py36he5b1da3_0
    parsel:           1.4.0-py36_0
    pyasn1:           0.4.2-py36h22e697c_0
    pyasn1-modules:   0.2.1-py36hd1453cb_0
    pydispatcher:     2.0.5-py36_0
    pytest-runner:    4.2-py36_0
    queuelib:         1.5.0-py36_0
    scrapy:           1.5.0-py36_0
    service_identity: 17.0.0-py36_0
    twisted:          17.5.0-py36_0
    w3lib:            1.19.0-py36_0
    zope:             1.0-py36_0
    zope.interface:   4.5.0-py36hfa6e2cd_0

The following packages will be UPDATED:

    ca-certificates:  2017.08.26-h94faf87_0      --> 2018.03.07-0
    certifi:          2017.7.27.1-py36h043bc9e_0 --> 2018.4.16-py36_0
    openssl:          1.0.2l-vc14hcac20b0_2      --> 1.0.2o-h8ea7d77_0

Proceed ([y]/n)? y

選擇 y 後繼續安裝:

Downloading and Extracting Packages
attrs 17.4.0################################################################################################### | 100%
pyasn1 0.4.2################################################################################################### | 100%
hyperlink 18.0.0############################################################################################### | 100%
openssl 1.0.2o################################################################################################# | 100%
pyasn1-modules 0.2.1########################################################################################### | 100%
pytest-runner 4.2############################################################################################## | 100%
ca-certificates 2018.03.07##################################################################################### | 100%
scrapy 1.5.0################################################################################################### | 100%
automat 0.6.0################################################################################################## | 100%
constantly 15.1.0############################################################################################## | 100%
cssselect 1.0.3################################################################################################ | 100%
incremental 17.5.0############################################################################################# | 100%
certifi 2018.4.16############################################################################################## | 100%
pydispatcher 2.0.5############################################################################################# | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

安裝完成,最後可以通過

>conda list
命令檢視 scrapy 是否安裝成功。