爬取煎蛋網圖片的一種思路
阿新 • • 發佈:2019-02-04
任何一個學習的過程,都需要練手專案。學網路爬蟲就總會想去爬點什麼東西。網上更多介紹的就是爬取網站圖片,圖片網站一般都有會自己的一套反爬技術。昨天遇到有帖子在說爬煎蛋網圖片,也就去試了試。
其中的反爬技術分析在 Python爬蟲(15):煎蛋網加密處理方式 部落格中已有詳細解說,思路方法也有說了,大家可以仔細去看看。在這裡,我的思路也一樣,但實現方法不是去將其js方法改造成為python方法(雖然我也覺得這是最佳方法,無奈我對加密演算法不熟悉,程式碼理解不了。接下來還是得去學學加密的演算法才行。)這裡使用一個偷巧的辦法,把js解密方法直接拿出來構造一個html檔案,再把抓到的圖片hash值放進去,讓它來給我解密還原成地址。有了地址你想怎麼下載就很容易了,我使用的是用迅雷。(爬取圖片的hash:把個含圖片的網頁都下載,直接抓取各個<span class="img-hash">***</span>
構造html檔案時,我是擷取jandan_load_img()中有關的兩行程式碼,jdXFKzuIDxRVqKYQfswJ5elNfow1x0JrJH()就全照原樣拷出來執行,然後開啟開發者工具
,邊執行邊看出現什麼錯誤,需要什麼方法就去原網站的js中尋找並補齊。除了hex_md5()外,其它方法都可以在原網站的js中找到。百度了一下,hex_md5()函式是在md5.js中,我下邊也給我整個md5.js檔案。(hex_md5()本來也是想拷貝出來用就好,可是看到md5.js裡邊好多引數,若是拷出來不知會涉及多少其它東西,所以就乾脆直接引用md5.js)。
先上圖:
抓取圖片hash值的py程式碼如下:
圖片hash都存放到img_hash.txt中
# -*- coding:utf-8 -*-
from lxml import etree
import requests, time
urls = ['http://jandan.net/ooxx/page-{}#comments'.format(i) for i in range(1, 41)]
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko)'
' Chrome/22.0.1207.1 Safari/537.1' }
i = 1
img_hash = []
print('Downloading:', end='')
for url in urls:
html = requests.get(url, headers=headers).text
root = etree.HTML(html)
span_img_hashs = root.xpath('//span[@class="img-hash"]')
for span_img_hash in span_img_hashs:
img_hash.append(span_img_hash.text)
print(i, '\t', end='')
i += 1
time.sleep(3)
print('Download completed!')
with open('img_hash.txt', 'a') as f:
f.write(str(img_hash))
html檔案如下:
- get_url()函式是我加上去的,將hash值作為引數呼叫jandan_load_img()
- 開啟img_hash.txt,將其中的hash值拷貝給get_url()函式的hashlist變數
<!DOCTYPE html>
<html>
<head>
<title></title>
<script type="text/ecmascript" src="md5.js"></script>
<script type="text/javascript">
function jandan_load_img(e) {
var c = jdjDMYMvK51QlNY6NdLY1OkZw6dpQvspIM(e, "aPz8sQnzRxiHfhgesalhIBhfKZczglYq");
var a = c.replace(/(\/\/\w+\.sinaimg\.cn\/)(\w+)(\/.+\.(gif|jpg|jpeg))/, "$1large$3");
return a
}
var jdjDMYMvK51QlNY6NdLY1OkZw6dpQvspIM = function(o, y, g) {
var d = o;
var l = "DECODE";
var y = y ? y : "";
var g = g ? g : 0;
var h = 4;
y = md5(y);
var x = md5(y.substr(0, 16));
var v = md5(y.substr(16, 16));
if (h) {
if (l == "DECODE") {
var b = md5(microtime());
var e = b.length - h;
var u = b.substr(e, h)
}
} else {
var u = ""
}
var t = x + md5(x + u);
var n;
if (l == "DECODE") {
g = g ? g + time() : 0;
tmpstr = g.toString();
if (tmpstr.length >= 10) {
o = tmpstr.substr(0, 10) + md5(o + v).substr(0, 16) + o
} else {
var f = 10 - tmpstr.length;
for (var q = 0; q < f; q++) {
tmpstr = "0" + tmpstr
}
o = tmpstr + md5(o + v).substr(0, 16) + o
}
n = o
}
var k = new Array(256);
for (var q = 0; q < 256; q++) {
k[q] = q
}
var r = new Array();
for (var q = 0; q < 256; q++) {
r[q] = t.charCodeAt(q % t.length)
}
for (var p = q = 0; q < 256; q++) {
p = (p + k[q] + r[q]) % 256;
tmp = k[q];
k[q] = k[p];
k[p] = tmp
}
var m = "";
n = n.split("");
for (var w = p = q = 0; q < n.length; q++) {
w = (w + 1) % 256;
p = (p + k[w]) % 256;
tmp = k[w];
k[w] = k[p];
k[p] = tmp;
m += chr(ord(n[q]) ^ (k[(k[w] + k[p]) % 256]))
}
if (l == "DECODE") {
m = base64_encode(m);
var c = new RegExp("=","g");
m = m.replace(c, "");
m = u + m;
m = base64_decode(d)
}
return m
};
function md5(a) {
return hex_md5(a)
}
function base64_encode(a) {
return window.btoa(a)
}
function base64_decode(a) {
return window.atob(a)
}
function microtime(b) {
var a = new Date().getTime();
var c = parseInt(a / 1000);
return b ? (a / 1000) : (a - (c * 1000)) / 1000 + " " + c
}
function chr(a) {
return String.fromCharCode(a)
}
function ord(a) {
return a.charCodeAt()
}
function get_url() {
var hashlist = ['Ly93eDQuc2luYWltZy5jbi9tdzYwMC8wMDc2QlNTNWx5MWZ1am93MDQyNGJqMzBpYTB0M3dnMi5qcGc=', 'Ly93dzMuc2luYWltZy5jbi9tdzEwMjQvMDA3M29iNlBneTFmdWpvNWdodGNiZzMwNnkwYW11MHkuZ2lm', 'Ly93eDQuc2luYWltZy5jbi9tdzYwMC8wMDc2QlNTNWx5MWZ1am5teGpqdWdqMzExMTFqazRxcC5qcGc='];
// var urllist = new Array()
var content = '';
for (hash in hashlist){
var url = 'http:' + jandan_load_img(hashlist[hash]);
// urllist[hash] = url;
content += '<a href="'+url+'">'+url+'</a>';
content += '<br>'
}
document.getElementById("content").innerHTML = content;
}
</script>
</head>
<body>
<button onclick="get_url()">click here</button>
<div id="content"></div>
</body>
</html>
md5.js:
/*
* A JavaScript implementation of the RSA Data Security, Inc. MD5 Message
* Digest Algorithm, as defined in RFC 1321.
* Version 2.1 Copyright (C) Paul Johnston 1999 - 2002.
* Other contributors: Greg Holt, Andrew Kepert, Ydnar, Lostinet
* Distributed under the BSD License
* See http://pajhome.org.uk/crypt/md5 for more info.
*/
/*
* Configurable variables. You may need to tweak these to be compatible with
* the server-side, but the defaults work in most cases.
*/
var hexcase = 0; /* hex output format. 0 - lowercase; 1 - uppercase */
var b64pad = ""; /* base-64 pad character. "=" for strict RFC compliance */
var chrsz = 8; /* bits per input character. 8 - ASCII; 16 - Unicode */
/*
* These are the functions you'll usually want to call
* They take string arguments and return either hex or base-64 encoded strings
*/
function hex_md5(s){ return binl2hex(core_md5(str2binl(s), s.length * chrsz));}
function b64_md5(s){ return binl2b64(core_md5(str2binl(s), s.length * chrsz));}
function str_md5(s){ return binl2str(core_md5(str2binl(s), s.length * chrsz));}
function hex_hmac_md5(key, data) { return binl2hex(core_hmac_md5(key, data)); }
function b64_hmac_md5(key, data) { return binl2b64(core_hmac_md5(key, data)); }
function str_hmac_md5(key, data) { return binl2str(core_hmac_md5(key, data)); }
/*
* Perform a simple self-test to see if the VM is working
*/
function md5_vm_test()
{
return hex_md5("abc") == "900150983cd24fb0d6963f7d28e17f72";
}
/*
* Calculate the MD5 of an array of little-endian words, and a bit length
*/
function core_md5(x, len)
{
/* append padding */
x[len >> 5] |= 0x80 << ((len) % 32);
x[(((len + 64) >>> 9) << 4) + 14] = len;
var a = 1732584193;
var b = -271733879;
var c = -1732584194;
var d = 271733878;
for(var i = 0; i < x.length; i += 16)
{
var olda = a;
var oldb = b;
var oldc = c;
var oldd = d;
a = md5_ff(a, b, c, d, x[i+ 0], 7 , -680876936);
d = md5_ff(d, a, b, c, x[i+ 1], 12, -389564586);
c = md5_ff(c, d, a, b, x[i+ 2], 17, 606105819);
b = md5_ff(b, c, d, a, x[i+ 3], 22, -1044525330);
a = md5_ff(a, b, c, d, x[i+ 4], 7 , -176418897);
d = md5_ff(d, a, b, c, x[i+ 5], 12, 1200080426);
c = md5_ff(c, d, a, b, x[i+ 6], 17, -1473231341);
b = md5_ff(b, c, d, a, x[i+ 7], 22, -45705983);
a = md5_ff(a, b, c, d, x[i+ 8], 7 , 1770035416);
d = md5_ff(d, a, b, c, x[i+ 9], 12, -1958414417);
c = md5_ff(c, d, a, b, x[i+10], 17, -42063);
b = md5_ff(b, c, d, a, x[i+11], 22, -1990404162);
a = md5_ff(a, b, c, d, x[i+12], 7 , 1804603682);
d = md5_ff(d, a, b, c, x[i+13], 12, -40341101);
c = md5_ff(c, d, a, b, x[i+14], 17, -1502002290);
b = md5_ff(b, c, d, a, x[i+15], 22, 1236535329);
a = md5_gg(a, b, c, d, x[i+ 1], 5 , -165796510);
d = md5_gg(d, a, b, c, x[i+ 6], 9 , -1069501632);
c = md5_gg(c, d, a, b, x[i+11], 14, 643717713);
b = md5_gg(b, c, d, a, x[i+ 0], 20, -373897302);
a = md5_gg(a, b, c, d, x[i+ 5], 5 , -701558691);
d = md5_gg(d, a, b, c, x[i+10], 9 , 38016083);
c = md5_gg(c, d, a, b, x[i+15], 14, -660478335);
b = md5_gg(b, c, d, a, x[i+ 4], 20, -405537848);
a = md5_gg(a, b, c, d, x[i+ 9], 5 , 568446438);
d = md5_gg(d, a, b, c, x[i+14], 9 , -1019803690