1. 程式人生 > >selenium+chrome瀏覽器驅動-爬取百度圖片

selenium+chrome瀏覽器驅動-爬取百度圖片

com max-age col presence and 下載 其他 htm row

百度圖片網頁中中,當頁面滾動到底部,頁面會加載新的內容。

我們通過selenium和谷歌瀏覽器驅動,執行js,是瀏覽器不斷加載頁面,通過抓取頁面的圖片路徑來下載圖片。

 1 from selenium import webdriver
 2 from selenium.webdriver.common.by import By
 3 from selenium.webdriver.support import expected_conditions as EC
 4 from selenium.webdriver.support.ui import WebDriverWait
 5 import
requests 6 from lxml import etree 7 import time 8 import random 9 import os 10 ‘‘‘ 11 爬取百度圖片,頁面向下拉到底,會加載新的網頁數據。 12 13 ‘‘‘ 14 15 # 構建請求頭 16 headers = { 17 "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", 18 "Accept-Encoding":"gzip, deflate, br", 19 "Accept-Language
":"zh-CN,zh;q=0.9", 20 "Cache-Control":"max-age=0", 21 "Connection":"keep-alive", 22 "Cookie":"winWH=%5E6_1197x581; BDIMGISLOGIN=0; BDqhfp=%E5%9B%BE%E7%89%87%26%260-10-1undefined%26%260%26%261; BIDUPSID=24942ACBA645FE0108AF48B5C2509013; BAIDUID=C05587CE8C62CAB17300AA09BC6820BD:FG=1; PSTM=1528274179; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; H_PS_PSSID=1440_25810_26459_21103_18559_20928; BDUSS=VNneDRnWTQ3fnVQOWJpTG95Z1RZVnllVzlRSURpWnBMWHlwbGZha2lGZWl3VlpiQUFBQUFBJCQAAAAAAAAAAAEAAAB9W1Rr1MbFzNGnzt7Wub6zAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAKI0L1uiNC9bW; PSINO=3; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; cflag=15%3A3; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; BDRCVFR[X_XKQks0S63]=mk3SLVN4HKm; firstShowTip=1; indexPageSugList=%5B%22%E5%9B%BE%E7%89%87%22%5D; cleanHistoryStatus=0
", 23 24 "Referer":"http://image.baidu.com/", 25 "Upgrade-Insecure-Requests":"1", 26 "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36" 27 } 28 # 創建瀏覽器對象 29 browser = webdriver.Chrome(executable_path=rE:\PycharmProjects\pachong\chromedriver.exe) 30 # 設置加載超時時間 31 wait = WebDriverWait(browser,20) 32 # 發送請求 33 browser.get(https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=&hs=0&xthttps=111111&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E5%9B%BE%E7%89%87&oq=%E5%9B%BE%E7%89%87&rsp=-1) 34 35 # 設置圖片下載路徑 36 path = ./baidupic/ 37 if not os.path.exists(path): 38 os.makedirs(path) 39 40 while True: 41 # 直到網頁中的圖片最後一個div加載成功。(每次加載新數據都是則將一個imgpaged的div) 42 wait.until(EC.presence_of_all_elements_located((By.XPATH,//div[@id="imgid"]/div[last()]))) 43 # 獲取網頁源 44 html = browser.page_source 45 html = etree.HTML(html) 46 # 獲取圖片的url 47 # img_urls = html.xpath(‘//div[@id="imgid"]/div[last()]//li/@data-objurl‘) #大圖 48 img_urls = html.xpath(//div[@id="imgid"]/div[last()]//img/@data-imgurl) #小圖 49 # print(img_url) 50 for img_url in img_urls: 51 #獲取圖片名字.(直接按原名字存儲,防止重名) 52 fname = img_url.split(/)[-1] 53 try: 54 response = requests.get(img_url,headers=headers) 55 data = response.content 56 with open(./baidupic/+fname,mode=wb) as f: 57 f.write(data) 58 except: 59 print(img_url,下載失敗) 60 61 # 防止請求過快,這裏是單線程下載圖片本身需要一定時間,先註釋掉 62 # time.sleep(2+ random.random()*1) 63 64 # 將頁面滾動底,加載新數據(執行js) 65 browser.execute_script(window.scrollTo(0,document.body.scrollHeight)) 66 # 頁面加載需要時間 67 time.sleep(5+ random.random()*1) 68 69 # break

請求頭headers中的內容源於瀏覽器的審查。刪除了Host內容,百度的有些大圖來源於其他網站,如果設置Host,一些大圖可能不能下載。

在網頁源碼中發現,圖片有大圖,有小圖,路徑不同。

selenium+chrome瀏覽器驅動-爬取百度圖片