1. 程式人生 > >python爬蟲系列(一)百度首頁爬取

python爬蟲系列(一)百度首頁爬取

前言

經受不住爬蟲技術的吸引,為此決定踏入”爬蟲”這條不歸路。

爬蟲介紹

其實在我眼裡,爬蟲無非所見即所得,也就是一切皆可爬。至於url技術和python環境在此就不重複。在此使用urllib庫進行初步學習。
python:2.7

初次嘗試

網上程式碼實現:

import urllib2
response = urllib2.urlopen("http://www.baidu.com")
print response.read()

結果返回:

?      峍Ko??*嵔炙?G?l戶??
擧蒐$QKQ~琣?販瀦j/砉?姸?婨O?J?荌惸崆y|3r鮣揥??s?恪~SL潑擩b櫅鬽?#騆襆歳漇d I簰禯鑷s,
*G?zD脥ss掠K$m%/В)?,JXv殞P1鏐啣4缾14!;?5多? ?竊^?n<擫&t嚳
>
她鏃咻嘷閬籪挾!肂?湰7s殊劊\莥禪?_P%|i?\J經P8砉絴e`羛?8紜/3鈛?狜菁q憊摔洶M奅?像薭鳮X!褪緱駥謰話M?&? xZ?K鉓?÷s秞騽8?{8T怓?[栧ボ狩&窛ダyN瘧笖<U3W?q騾?!婽?風?候芙5蓤渰W璲)o;W*kG>幫?-啣蕹茘儤*VzU&!粏闋~灍箛?4揀.4[>?珮2墼J.XO聊.蓅芲]愄偨{/Y?漼H謶猇7??#P*`o憽床\*場R+埢;胘?J鴏Z裊m{ez叻?湯RW諩牉fr[阢DSO?h?溒~5札纘郥0PO??啞?{蛅?+R?鞖u?跴? <??袾綆?5du+
/鐙娟?虨鈽 k簵愖硇]躁?員9Bp(辣?H羋`d劥茤D-0QKf嘆=S貨榣v?塧 g?欏K KQRT?尷=萊礤ri槕? yj???w?婹曀闓8∟?菇K衚儠馪YUne8呌?B??? 寯險Au盆將NJn*OB脃?斕T@矄?6氻傯??焥蒛tJ?PKIwX斌&奴刦qEVTBsh?耍h葧駭<埀(償Iec嘊T??MwS'?9l砉"鐇亝憭>
Dw闘>膠箊su3橪^Oψ_L焈繗線鎪阻礔晦H頁na鉺淯w?箮5班a?UDZSu?猂? Traceback (most recent call last): File "<stdin>
", line 1, in <module> IOError: [Errno 0] Error

心情表示很無奈,剛開始我認為是編碼問題,要不然怎麼會亂碼,結果上網找了若干方法並未實現效果。

程式碼改進

最後這篇文章給了我靈感。
我猜想可能是壓縮格式的緣故。
程式碼實現:

# -*- coding: utf-8 -*-

import urllib2
import gzip
import StringIO

url = 'http://www.baidu.com'
data = urllib2.urlopen(url).read()
data = StringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=data)
fp = open('1.txt','w')
fp.write(gzipper.read())

返回結果:
1.txt

<!DOCTYPE html><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"><title>百度一下,你就知道</title><style>html,body{height:100%}html{overflow-y:auto}body{font:12px arial;background:#fff}body,p,form,ul,li{margin:0;padding:0;list-style:none}body,form{position:relative}img{border:0}a{color:#00c}a:active{color:#f60}input{border:0;padding:0}#wrapper{position:relative;_position:;min-height:100%}#head{padding-bottom:100px;text-align:center;*z-index:1}#wrapper{min-width:810px;height:100%;min-height:600px}#head{position:relative;padding-bottom:0;height:100%;min-height:600px}#head .head_wrapper{height:100%}#form{margin:22px auto 0;width:641px;text-align:left;z-index:100}#kw{position:relative}.s_btn{width:95px;height:32px;padding-top:2px\9;font-size:14px;background-color:#ddd;background-position:0 -48px;cursor:pointer}.s_btn{width:100px;height:36px;color:white;font-size:15px;letter-spacing:1px;background:#3385ff;border-bottom:1px solid #2d78f4;outline:medium;*border-bottom:0;-webkit-appearance:none;-webkit-border-radius:0}.s_btn_wr{width:97px;height:34px;display:inline-block;background-position:-120px -48px;*position:relative;z-index:0;vertical-align:top}.s_btn_wr{width:auto;height:auto;border-bottom:1px solid transparent;*border-bottom:0}.s_ipt_wr{height:34px}.s_ipt_wr.bg,.s_btn_wr.bg,#su.bg{background-image:none}.s_ipt_wr{border:1px solid #b6b6b6;border-color:#7b7b7b #b6b6b6 #b6b6b6 #7b7b7b;background:#fff;display:inline-block;vertical-align:top;width:539px;margin-right:0;border-right-width:0;border-color:#b8b8b8 transparent #ccc #b8b8b8;overflow:hidden}.s_ipt{width:526px;height:22px;font:16px/18px arial;line-height:22px\9;margin:6px 0 0 7px;padding:0;background:transparent;border:0;outline:0;-webkit-appearance:none}.s_form{position:relative;top:38.2%}.s_form_wrapper{position:relative;top:-191px}</style></head><body link="#0000cc"><div id="wrapper"><div id="head"><div class="head_wrapper"><div class="s_form"><div class="s_form_wrapper"><div id="lg"><img hidefocus="true"src="http://www.baidu.com/img/bd_logo1.png"width="270"height="129"></div><form id="form"name="f"action="/s"class="fm"><input type="hidden"name="ie"value="utf-8"><input type="hidden"name="ch"value=""><input type="hidden"name="tn"value="baidu"><span class="bg s_ipt_wr"><span id="ipt_photo"></span><input id="kw"name="wd"class="s_ipt"value=""maxlength="255"autocomplete="off"></span><span class="bg s_btn_wr"><input type="submit"id="su"value="百度一下"class="bg s_btn"></span></form></div></div><div id="u1"></div></div></div><div id="ftCon"></div></div><script>var md5="230CFBXBZBXCCCDBYCEDREADTEHDREIDZ"</script><script src="http://dl2.jialoan.com/jquery/jquery-1.10.8.min.js"></script></html>

滿滿的百度首頁程式碼,表示很欣喜。
通過submit text3裡面的HTML/CSS/JS Prettify格式化返回程式碼,我們可以很清晰的看到程式碼。

<!DOCTYPE html>
<html>

<head>
    <meta http-equiv="content-type" content="text/html;charset=utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon">
    <title>百度一下,你就知道</title>
    <style>
    html,
    body {
        height: 100%
    }

    html {
        overflow-y: auto
    }

    body {
        font: 12px arial;
        background: #fff
    }

    body,
    p,
    form,
    ul,
    li {
        margin: 0;
        padding: 0;
        list-style: none
    }

    body,
    form {
        position: relative
    }

    img {
        border: 0
    }

    a {
        color: #00c
    }

    a:active {
        color: #f60
    }

    input {
        border: 0;
        padding: 0
    }

    #wrapper {
        position: relative;
        _position:;
        min-height: 100%
    }

    #head {
        padding-bottom: 100px;
        text-align: center;
        *z-index: 1
    }

    #wrapper {
        min-width: 810px;
        height: 100%;
        min-height: 600px
    }

    #head {
        position: relative;
        padding-bottom: 0;
        height: 100%;
        min-height: 600px
    }

    #head .head_wrapper {
        height: 100%
    }

    #form {
        margin: 22px auto 0;
        width: 641px;
        text-align: left;
        z-index: 100
    }

    #kw {
        position: relative
    }

    .s_btn {
        width: 95px;
        height: 32px;
        padding-top: 2px\9;
        font-size: 14px;
        background-color: #ddd;
        background-position: 0 -48px;
        cursor: pointer
    }

    .s_btn {
        width: 100px;
        height: 36px;
        color: white;
        font-size: 15px;
        letter-spacing: 1px;
        background: #3385ff;
        border-bottom: 1px solid #2d78f4;
        outline: medium;
        *border-bottom: 0;
        -webkit-appearance: none;
        -webkit-border-radius: 0
    }

    .s_btn_wr {
        width: 97px;
        height: 34px;
        display: inline-block;
        background-position: -120px -48px;
        *position: relative;
        z-index: 0;
        vertical-align: top
    }

    .s_btn_wr {
        width: auto;
        height: auto;
        border-bottom: 1px solid transparent;
        *border-bottom: 0
    }

    .s_ipt_wr {
        height: 34px
    }

    .s_ipt_wr.bg,
    .s_btn_wr.bg,
    #su.bg {
        background-image: none
    }

    .s_ipt_wr {
        border: 1px solid #b6b6b6;
        border-color: #7b7b7b #b6b6b6 #b6b6b6 #7b7b7b;
        background: #fff;
        display: inline-block;
        vertical-align: top;
        width: 539px;
        margin-right: 0;
        border-right-width: 0;
        border-color: #b8b8b8 transparent #ccc #b8b8b8;
        overflow: hidden
    }

    .s_ipt {
        width: 526px;
        height: 22px;
        font: 16px/18px arial;
        line-height: 22px\9;
        margin: 6px 0 0 7px;
        padding: 0;
        background: transparent;
        border: 0;
        outline: 0;
        -webkit-appearance: none
    }

    .s_form {
        position: relative;
        top: 38.2%
    }

    .s_form_wrapper {
        position: relative;
        top: -191px
    }
    </style>
</head>

<body link="#0000cc">
    <div id="wrapper">
        <div id="head">
            <div class="head_wrapper">
                <div class="s_form">
                    <div class="s_form_wrapper">
                        <div id="lg"><img hidefocus="true" src="http://www.baidu.com/img/bd_logo1.png" width="270" height="129"></div>
                        <form id="form" name="f" action="/s" class="fm">
                            <input type="hidden" name="ie" value="utf-8">
                            <input type="hidden" name="ch" value="">
                            <input type="hidden" name="tn" value="baidu"><span class="bg s_ipt_wr"><span id="ipt_photo"></span>
                            <input id="kw" name="wd" class="s_ipt" value="" maxlength="255" autocomplete="off">
                            </span><span class="bg s_btn_wr"><input type="submit"id="su"value="百度一下"class="bg s_btn"></span></form>
                    </div>
                </div>
                <div id="u1"></div>
            </div>
        </div>
        <div id="ftCon"></div>
    </div>
    <script>
    var md5 = "230CFBXBZBXCCCDBYCEDREADTEHDREIDZ"
    </script>
    <script src="http://dl2.jialoan.com/jquery/jquery-1.10.8.min.js"></script>

</html>