1. 程式人生 > >爬取煎蛋網圖片的一種思路

爬取煎蛋網圖片的一種思路

任何一個學習的過程,都需要練手專案。學網路爬蟲就總會想去爬點什麼東西。網上更多介紹的就是爬取網站圖片,圖片網站一般都有會自己的一套反爬技術。昨天遇到有帖子在說爬煎蛋網圖片,也就去試了試。

其中的反爬技術分析在 Python爬蟲(15):煎蛋網加密處理方式 部落格中已有詳細解說,思路方法也有說了,大家可以仔細去看看。在這裡,我的思路也一樣,但實現方法不是去將其js方法改造成為python方法(雖然我也覺得這是最佳方法,無奈我對加密演算法不熟悉,程式碼理解不了。接下來還是得去學學加密的演算法才行。)這裡使用一個偷巧的辦法,把js解密方法直接拿出來構造一個html檔案,再把抓到的圖片hash值放進去,讓它來給我解密還原成地址。有了地址你想怎麼下載就很容易了,我使用的是用迅雷。(爬取圖片的hash:把個含圖片的網頁都下載,直接抓取各個<span class="img-hash">***</span>

值)。

構造html檔案時,我是擷取jandan_load_img()中有關的兩行程式碼,jdXFKzuIDxRVqKYQfswJ5elNfow1x0JrJH()就全照原樣拷出來執行,然後開啟開發者工具,邊執行邊看出現什麼錯誤,需要什麼方法就去原網站的js中尋找並補齊。除了hex_md5()外,其它方法都可以在原網站的js中找到。百度了一下,hex_md5()函式是在md5.js中,我下邊也給我整個md5.js檔案。(hex_md5()本來也是想拷貝出來用就好,可是看到md5.js裡邊好多引數,若是拷出來不知會涉及多少其它東西,所以就乾脆直接引用md5.js)。

先上圖:
html.png

抓取圖片hash值的py程式碼如下:

圖片hash都存放到img_hash.txt中

# -*- coding:utf-8 -*-
from lxml import etree
import requests, time

urls = ['http://jandan.net/ooxx/page-{}#comments'.format(i) for i in range(1, 41)]
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko)'
                         ' Chrome/22.0.1207.1 Safari/537.1'
} i = 1 img_hash = [] print('Downloading:', end='') for url in urls: html = requests.get(url, headers=headers).text root = etree.HTML(html) span_img_hashs = root.xpath('//span[@class="img-hash"]') for span_img_hash in span_img_hashs: img_hash.append(span_img_hash.text) print(i, '\t', end='') i += 1 time.sleep(3) print('Download completed!') with open('img_hash.txt', 'a') as f: f.write(str(img_hash))

html檔案如下:

  • get_url()函式是我加上去的,將hash值作為引數呼叫jandan_load_img()
  • 開啟img_hash.txt,將其中的hash值拷貝給get_url()函式的hashlist變數
<!DOCTYPE html>
<html>
<head>
    <title></title>
    <script type="text/ecmascript" src="md5.js"></script>
    <script type="text/javascript">
        function jandan_load_img(e) {
            var c = jdjDMYMvK51QlNY6NdLY1OkZw6dpQvspIM(e, "aPz8sQnzRxiHfhgesalhIBhfKZczglYq");
            var a = c.replace(/(\/\/\w+\.sinaimg\.cn\/)(\w+)(\/.+\.(gif|jpg|jpeg))/, "$1large$3");
            return a
        }
        var jdjDMYMvK51QlNY6NdLY1OkZw6dpQvspIM = function(o, y, g) {
            var d = o;
            var l = "DECODE";
            var y = y ? y : "";
            var g = g ? g : 0;
            var h = 4;
            y = md5(y);
            var x = md5(y.substr(0, 16));
            var v = md5(y.substr(16, 16));
            if (h) {
                if (l == "DECODE") {
                    var b = md5(microtime());
                    var e = b.length - h;
                    var u = b.substr(e, h)
                }
            } else {
                var u = ""
            }
            var t = x + md5(x + u);
            var n;
            if (l == "DECODE") {
                g = g ? g + time() : 0;
                tmpstr = g.toString();
                if (tmpstr.length >= 10) {
                    o = tmpstr.substr(0, 10) + md5(o + v).substr(0, 16) + o
                } else {
                    var f = 10 - tmpstr.length;
                    for (var q = 0; q < f; q++) {
                        tmpstr = "0" + tmpstr
                    }
                    o = tmpstr + md5(o + v).substr(0, 16) + o
                }
                n = o
            }
            var k = new Array(256);
            for (var q = 0; q < 256; q++) {
                k[q] = q
            }
            var r = new Array();
            for (var q = 0; q < 256; q++) {
                r[q] = t.charCodeAt(q % t.length)
            }
            for (var p = q = 0; q < 256; q++) {
                p = (p + k[q] + r[q]) % 256;
                tmp = k[q];
                k[q] = k[p];
                k[p] = tmp
            }
            var m = "";
            n = n.split("");
            for (var w = p = q = 0; q < n.length; q++) {
                w = (w + 1) % 256;
                p = (p + k[w]) % 256;
                tmp = k[w];
                k[w] = k[p];
                k[p] = tmp;
                m += chr(ord(n[q]) ^ (k[(k[w] + k[p]) % 256]))
            }
            if (l == "DECODE") {
                m = base64_encode(m);
                var c = new RegExp("=","g");
                m = m.replace(c, "");
                m = u + m;
                m = base64_decode(d)
            }
            return m
        };
        function md5(a) {
            return hex_md5(a)
        }
        function base64_encode(a) {
            return window.btoa(a)
        }
        function base64_decode(a) {
            return window.atob(a)
        }
        function microtime(b) {
            var a = new Date().getTime();
            var c = parseInt(a / 1000);
            return b ? (a / 1000) : (a - (c * 1000)) / 1000 + " " + c
        }
        function chr(a) {
            return String.fromCharCode(a)
        }
        function ord(a) {
            return a.charCodeAt()
        }
        function get_url() {
            var hashlist = ['Ly93eDQuc2luYWltZy5jbi9tdzYwMC8wMDc2QlNTNWx5MWZ1am93MDQyNGJqMzBpYTB0M3dnMi5qcGc=', 'Ly93dzMuc2luYWltZy5jbi9tdzEwMjQvMDA3M29iNlBneTFmdWpvNWdodGNiZzMwNnkwYW11MHkuZ2lm', 'Ly93eDQuc2luYWltZy5jbi9tdzYwMC8wMDc2QlNTNWx5MWZ1am5teGpqdWdqMzExMTFqazRxcC5qcGc='];
            // var urllist = new Array()
            var content = '';
            for (hash in hashlist){
                var url = 'http:' + jandan_load_img(hashlist[hash]);
                // urllist[hash] = url;
                content += '<a href="'+url+'">'+url+'</a>';
                content += '<br>'
            }
            document.getElementById("content").innerHTML = content;
        }
    </script>
</head>
<body>
    <button onclick="get_url()">click here</button>
    <div id="content"></div>
</body>
</html>

md5.js:

/*
 * A JavaScript implementation of the RSA Data Security, Inc. MD5 Message
 * Digest Algorithm, as defined in RFC 1321.
 * Version 2.1 Copyright (C) Paul Johnston 1999 - 2002.
 * Other contributors: Greg Holt, Andrew Kepert, Ydnar, Lostinet
 * Distributed under the BSD License
 * See http://pajhome.org.uk/crypt/md5 for more info.
 */
/*
 * Configurable variables. You may need to tweak these to be compatible with
 * the server-side, but the defaults work in most cases.
 */
var hexcase = 0; /* hex output format. 0 - lowercase; 1 - uppercase  */
var b64pad = ""; /* base-64 pad character. "=" for strict RFC compliance */
var chrsz = 8; /* bits per input character. 8 - ASCII; 16 - Unicode  */
/*
 * These are the functions you'll usually want to call
 * They take string arguments and return either hex or base-64 encoded strings
 */
function hex_md5(s){ return binl2hex(core_md5(str2binl(s), s.length * chrsz));}
function b64_md5(s){ return binl2b64(core_md5(str2binl(s), s.length * chrsz));}
function str_md5(s){ return binl2str(core_md5(str2binl(s), s.length * chrsz));}
function hex_hmac_md5(key, data) { return binl2hex(core_hmac_md5(key, data)); }
function b64_hmac_md5(key, data) { return binl2b64(core_hmac_md5(key, data)); }
function str_hmac_md5(key, data) { return binl2str(core_hmac_md5(key, data)); }
/*
 * Perform a simple self-test to see if the VM is working
 */
function md5_vm_test()
{
 return hex_md5("abc") == "900150983cd24fb0d6963f7d28e17f72";
}
/*
 * Calculate the MD5 of an array of little-endian words, and a bit length
 */
function core_md5(x, len)
{
 /* append padding */
 x[len >> 5] |= 0x80 << ((len) % 32);
 x[(((len + 64) >>> 9) << 4) + 14] = len;
 var a = 1732584193;
 var b = -271733879;
 var c = -1732584194;
 var d = 271733878;
 for(var i = 0; i < x.length; i += 16)
 {
 var olda = a;
 var oldb = b;
 var oldc = c;
 var oldd = d;
 a = md5_ff(a, b, c, d, x[i+ 0], 7 , -680876936);
 d = md5_ff(d, a, b, c, x[i+ 1], 12, -389564586);
 c = md5_ff(c, d, a, b, x[i+ 2], 17, 606105819);
 b = md5_ff(b, c, d, a, x[i+ 3], 22, -1044525330);
 a = md5_ff(a, b, c, d, x[i+ 4], 7 , -176418897);
 d = md5_ff(d, a, b, c, x[i+ 5], 12, 1200080426);
 c = md5_ff(c, d, a, b, x[i+ 6], 17, -1473231341);
 b = md5_ff(b, c, d, a, x[i+ 7], 22, -45705983);
 a = md5_ff(a, b, c, d, x[i+ 8], 7 , 1770035416);
 d = md5_ff(d, a, b, c, x[i+ 9], 12, -1958414417);
 c = md5_ff(c, d, a, b, x[i+10], 17, -42063);
 b = md5_ff(b, c, d, a, x[i+11], 22, -1990404162);
 a = md5_ff(a, b, c, d, x[i+12], 7 , 1804603682);
 d = md5_ff(d, a, b, c, x[i+13], 12, -40341101);
 c = md5_ff(c, d, a, b, x[i+14], 17, -1502002290);
 b = md5_ff(b, c, d, a, x[i+15], 22, 1236535329);
 a = md5_gg(a, b, c, d, x[i+ 1], 5 , -165796510);
 d = md5_gg(d, a, b, c, x[i+ 6], 9 , -1069501632);
 c = md5_gg(c, d, a, b, x[i+11], 14, 643717713);
 b = md5_gg(b, c, d, a, x[i+ 0], 20, -373897302);
 a = md5_gg(a, b, c, d, x[i+ 5], 5 , -701558691);
 d = md5_gg(d, a, b, c, x[i+10], 9 , 38016083);
 c = md5_gg(c, d, a, b, x[i+15], 14, -660478335);
 b = md5_gg(b, c, d, a, x[i+ 4], 20, -405537848);
 a = md5_gg(a, b, c, d, x[i+ 9], 5 , 568446438);
 d = md5_gg(d, a, b, c, x[i+14], 9 , -1019803690