1. 程式人生 > >python 2.7 圖片下載爬蟲

python 2.7 圖片下載爬蟲

寫圖片爬蟲的一些心得

1.先到所要下載圖片的網址看看,頁面請求的網址是哪個(我用的是goolge瀏覽器)


2.點選所要下載的圖片,檢視其具體位置,(方便查詢img連結)

3.找好之後就可以寫程式碼了

4.主要難度是找到img=“”的具體位置,需要正則表達搜尋一下
不會正則的或是beautifulsoup的小夥伴可以參考一下這兩個視屏
beautifulsoup:https://www.youtube.com/watch?v=KLq0W1wUVmw&index=3&list=PLXO45tsB95cIuXEgV-mvYWRd_hVC43Akk
正則:https://www.youtube.com/watch?v=l1MAW1z641E
4.搜尋成功後將其下載到本地檔案中

以下是小編我自己寫的程式碼

未改良版的:

#coding=utf-8
import requests
import os
from bs4 import BeautifulSoup

url = "http://www.ngchina.com.cn/magazine/2018/10/1337.html"
html = requests.get(url).text
soup = BeautifulSoup(html,'lxml')

all_img = soup.find_all('a',{'class':'img_btn'})

root = "C://img222//"
os.makedirs(root,mode=0o777)

for ul in all_img:
    imgs = ul.find_all('img')

    for ull in imgs:
        imgss = ull['src']

        r=requests.get(imgss,stream=True)
        path =root + imgss.split('/')[-1]
        try:
            with open(path, 'wb') as f:
                for chunk in r.iter_content(chunk_size=100):
                    f.write(chunk)
            print path
        except:
            print "ERRor"

改良版的:

#coding=utf-8
import requests
import os
from bs4 import BeautifulSoup

def get_url(url):
    headers = {
        "user-agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
               "referer" : "http://www.ngchina.com.cn/magazine/2018/10/1337.html"
    }
    res = requests.get (url, headers = headers )
    return res


def main():
    url = "http://www.ngchina.com.cn/magazine/2018/10/1337.html"
    res = get_url(url)
    html = res.text
    soup = BeautifulSoup(html, 'lxml')

    all_imgs = soup.find_all('a', {'class': 'img_btn'})

    for ul in all_imgs:
        imgs = ul.find_all('img')
        for l in imgs:
             imgss = l['src']
             r = requests.get(imgss, stream=True)
             root = "C://img222//"
             path = root + imgss.split('/')[-1]
             try:
                with open (path,"wb") as f:
                    for chunk in r.iter_content(chunk_size=128):
                        f.write(chunk)
                print path
             except:
                print ERROE

if __name__ == "__main__":
    main()