1. 程式人生 > >python爬蟲+網頁點選事件+selenium模擬瀏覽器,爬取選股寶內容

python爬蟲+網頁點選事件+selenium模擬瀏覽器,爬取選股寶內容

(一)PYTHON的安裝(已安裝,可跳過此步驟)

        1、PYTHON下載

                PYTHON官網:https://www.python.org/

                

              按照對應的系統下載,我這裡是window系統,點選windows,根據自己系統操作位數下載

                

                            新增環境變數:

                                

                            如果沒有新增,可自行在計算機新增

                                    

(二)第三方庫的安裝

          安裝selenium和pyquery,在cmd命令下輸入 pip install selenium(pyquery安裝類似)

                

          如果報錯則進入python目錄下scripts目錄下,再輸入命令

              

        在python的IDLE下如果沒有報錯則安裝成功

              

(三)瀏覽器及相應瀏覽器驅動的安裝

      這裡主要使用chrome瀏覽器,自行去谷歌官網下載安裝,安裝後檢視谷歌版本,單擊關於google chrome

    (一般在這裡都可以找到)

                

         我的版本:67.0

                                    

        下載對應的webdriver  (這裡沒有可以自己推測每3個版本,對應一個v65-67---v2.38,即v68-70--v2.39)

        下載地址:http://chromedriver.storage.googleapis.com/index.html

      

     把下載好的chromedriver.exe,解壓後放到python36目錄下(或者python36目錄下的scripts) 

                      在IDLE下執行如下程式碼,會自動開啟你谷歌瀏覽器(環境搭建完成)

                     

(四)例項程式碼詳解

          要求:爬取https://www.xuangubao.cn/股票網站的資訊((“利好”或“利空”)和“相關股票”),實現點選載入更多

         

      (1)開啟瀏覽器,獲取相關訊息:

                

     (2)網頁分析(selenium有很多定位方法,這裡用到比較少)

            在上面的程式碼中,data已經擁有了網頁當前頁的所有內容(可輸出觀察);我們現在只要從中拿到我們想要的資料

            在網頁中,右鍵,審查元素,(或者檢查)分析網頁:(由於得到的data可能與網頁的分析有所出入,建議最好輸出data,從data中分析得到正則表示式)

           

           我們用正則表示式找到所有以1開頭,以2為結尾內容用findall函式

                

   實現點選:(點選之後的”利好”和“相關股票”的正則會與首頁的不同(後面點選的與第一次點選後的一樣))

      定位方法有:(這裡用的比較少不做詳細介紹,大家自行了解):  

                 find_element_by_id 當你知道一個元素的id屬性時使用它。使用此策略,將返回具有與該位置匹配的id屬性值的第一個元素。                  find_element_by_name 當你知道一個元素的name屬性時使用它。使用此策略,將返回具有與該位置匹配的id屬性值的第一個元素。                  find_element_by_xpath                  find_element_by_link_text                 find_element_by_partial_link_text                 find_element_by_tag_name                 find_element_by_class_name                find_element_by_css_selector

    原始碼附上:

#coding=utf-8
from selenium import webdriver
import time
import re
from pyquery import PyQuery as pq


def openurl(url,num):
        browser  = webdriver.Chrome()  #開啟瀏覽器
        browser.get(url)               #進入相關網站
        html=browser.page_source       #獲取網站原始碼
        data=str(pq(html))             #str() 函式將物件轉化為適於人閱讀的形式。
                                                                                                              
        dic={}                         
        re_rule=r'<div class="news-item-container">(.*?)<div data-v-00b2e9bc=""/>'       
        datalist=re.findall(re_rule,data,re.S)
        for i in range(0,len(datalist)):
                rule1=r'<img src="/img/icon-lihao.png" data-v-6c26747a=""/>(.*?)<!----></span>'
                bullish = re.findall(rule1,datalist[i],re.S)
                if len(bullish)==0:
                        rule1=r'<img src="/img/icon-likong.png" data-v-6c26747a=""/>(.*?)</span>'
                        bullish = re.findall(rule1,datalist[i],re.S)
                        
                rule2=r'<span class="stock-group-item-name" data-v-f97d9694="">(.*?)</span>'
                stock_name=re.findall(rule2,datalist[i], re.S)
                
                if len(stock_name) > 0 and len( bullish) > 0:
                        for c in range(0,len(stock_name)):
                                dic[stock_name[c]]= bullish[0]
                                print("正在爬取第",len(dic)+1,"個請稍等.....") 
              
        c=len(datalist)
        if len(dic) < num:
                while(1):
                        browser.find_element_by_class_name("home-news-footer").click()
                        time.sleep(1)
                        html=browser.page_source
                        data=str(pq(html))
                        datalist=re.findall(re_rule,data,re.S)
                        for i in range(c,len(datalist)):
                                rule3=r'<img data-v-6c26747a="" src="/img/icon-lihao.png"/>(.*?)<!----></span>'
                                bullish = re.findall(rule3,datalist[i],re.S)
                                if len(bullish)==0:
                                        rule5=r'<img data-v-6c26747a="" src="/img/icon-likong.png"/>(.*?)</span>'
                                        bullish = re.findall(rule5,datalist[i],re.S)
                                rule4=r'<span data-v-f97d9694="" class="stock-group-item-name">(.*?)</span>'
                                stock_name=re.findall(rule4,datalist[i], re.S)
                                                
                                if len(stock_name) > 0 and len( bullish) > 0:
                                        for c in range(0,len(stock_name)):
                                                dic[stock_name[c]]= bullish[0]
                                                
                            
                        c=len(datalist)
                        if len(dic) > num :
                                browser.quit()
                                print("爬取完畢!!")
                                break


                        print("正在爬取第",len(dic)+1,"個請稍等.....")   
        else:
                browser.quit()
                print("爬取完畢!!")
                
        return dic
                                                                         
url='https://www.xuangubao.cn/'
dict=openurl(url,3)
print(dict)
#f=open("F:\\text.txt","a")
#for key,values in  dict.items():
        #f.write((key+"\t"))
        #print(key,values)
#f.close()     

---------------------  原文:https://blog.csdn.net/weixin_42551465/article/details/80817552