1. 程式人生 > >爬取糗事百科圖片,(截止至2016/10/23可用)

爬取糗事百科圖片,(截止至2016/10/23可用)

區分開頭像和圖片所在資料夾就好

頭像

<div class="article block untagged mb15" id='qiushi_tag_117810314'>

<div class="author clearfix">
<a href="/users/22028925/" target="_blank" rel="nofollow">
<img src="http://pic.qiushibaike.com/system/avtnew/2202/22028925/medium/2016100101212195.JPEG" alt="紅顏一笑醉心絃~"/>
</a>
<a href="/users/22028925/" target="_blank" title="紅顏一笑醉心絃~">
<h2>紅顏一笑醉心絃~</h2>
</a>
<div class="articleGender manIcon">99</div>
</div>

真正的圖
<div class="thumb">

<a href="/article/117810314" target="_blank">
<img src="http://pic.qiushibaike.com/system/pictures/11781/117810314/medium/app117810314.jpg" alt="隔著螢幕都聽到它沉重的喘氣聲" />
</a>

</div>

一個是avtnew,一個是pictures,正則即可(我寫的比較搓)
from urllib.request import Request,urlopen ,urlretrieve
from bs4 import BeautifulSoup
import re
import os
H = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url = "http://www.qiushibaike.com/imgrank/page/5/?s=4922922"
req = Request(url=url,headers=H)
html = urlopen(req)
src = BeautifulSoup(html,"html.parser")
a = src.findAll("img",{"src":re.compile("http:\/\/pic\.qiushibaike\.com\/system\/pictures.*\.jpg")})

#建立資料夾
dir = os.getcwd()+"\\pic"  
if not os.path.exists(dir):
    os.makedirs(dir)

x = 1
for i in a:
    path = i["src"]
    urlretrieve(path,dir+'\\%s.jpg'%x)#下載
    x+=1