微博社交內容資訊爬取(selenium和equests請求介面兩種方法)
阿新 • • 發佈:2018-12-30
總體目標:抓取微博內容資訊;給一個入口,抓分佈抓取關注list,然後給到具體連結去爬取具體個人的微博資訊
#我這現在有兩種方式,selenium模擬登陸和下拉,二是分析介面獲取具體微博資訊,這邊暫時沒有完全放開安裝list關注再分別一次抓取每個關注的微博,後面會更新
一,requests抓取
1.分析
具體微博個人頁面,首先你需要切換到全部微博,然後下拉分析,後面內容是通過ajax載入,在data裡面資料格式是html程式碼;然後需要注意的是微博反爬訪問太多會封賬號返回404錯誤,這裡面建議要準備不同的cookie後續
還有一個問題是,正常微博頁面是分3段載入,首先出現一部分,後面兩部分需要下拉兩次ajax載入處理,這裡可以F12查勘單,有個小技巧就是這個可以通過改變請求連結引數把第一段也通過json資料庫請求回來,只需要去掉後面pre_page那開始包括他刪除就得到第一段內容;另外不同頁面變化也是通過更改連結引數獲取,改變頁面是通過改變page和pre_page的數,改變同一頁的不同json段是pagebar的值1或2
還有技巧就是能通過第一頁的最後一段json資料,能查到總共微博頁數,這就方便後面遍歷全部抓取
#效果及程式碼
import time import re from lxml import etree import requests import json def request1(url): headers={ # "Cookie":"SINAGLOBAL=7238757845138.87.1528291392417; UOR=,,spr_web_360_hao360_weibo_t001; login_sid_t=bd5a4abe734c091249cdce71379c0348; cross_origin_proto=SSL; Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; TC-V5-G0=866fef700b11606a930f0b3297300d95; _s_tentry=-; Apache=685802145012.8082.1542780237180; ULV=1542780237187:19:3:1:685802145012.8082.1542780237180:1541462062210; TC-Page-G0=cdcf495cbaea129529aa606e7629fea7; WBtopGlobal_register_version=18608f873d5d88f2; SSOLoginState=1542781061; wvr=6; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W50QEC8VdzuOHjXwxjEGser5JpX5K2hUgL.Fo-feo.ceKe4S0M2dJLoIpjLxKqLBoqL1-qLxKqLB.eLB-2LxKqL1KMLB.2t; ALF=1574324177; SCF=ApgoQqG5luyu67rkHic6LidzChLHTIe5EQZgRnsuPrfkK57iJqk723zd_GSb5ZMq2jbGlYvGXkZ6LbJj5PpY6zI.; SUB=_2A2528WQBDeRhGeNL6VsX8S3FzDuIHXVVh9LJrDV8PUNbmtAKLVnXkW9NSQ30mXwLfrcwH1SRYaTHBUXB4ipbEQrL; SUHB=02MvCTyTmQYvsK; un=18514476337; YF-V5-G0=a5a6106293f9aeef5e34a2e71f04fae4; wb_view_log_5529613977=1920*10801", "Accept": "*/*", "Accept-Encoding": "gzip, deflate, br", "Accept-Languag": "zh-CN,zh;q=0.9", "Connection": "keep-alive", "Content-Type": "application/x-www-form-urlencoded", "Cookie": "SINAGLOBAL=7238757845138.87.1528291392417; un=18514476337; UOR=,,login.sina.com.cn; SCF=ApgoQqG5luyu67rkHic6LidzChLHTIe5EQZgRnsuPrfkww0JTtREftUveuuJafUL3dSgYNHTvqTAmG9myhm1k58.; SUHB=09P9Vm5BSmNlLF; _s_tentry=login.sina.com.cn; Apache=3613626040096.6445.1542931074283; ULV=1542931074334:22:6:4:3613626040096.6445.1542931074283:1542888731560; SUBP=0033WrSXqPxfM72wWs9jqgMF55529P9D9W50QEC8VdzuOHjXwxjEGser5JpVF02RSK2XShMce0eN; SUB=_2AkMsq84tdcPxrAZUmvETzWjra4pH-jyffqfbAn7uJhMyAxh77mgtqSVutBF-XJVxcy_VTV1-kjKoQyDwqoPCwTmq; login_sid_t=174e473124cb2d9808b4e8cd5a9739e1; cross_origin_proto=SSL", "Host": "weibo.com", "Referer": "https://weibo.com/dajiakuishow", "User-Agent": "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", "X-Requested-With": "XMLHttpRequest", } html=requests.get(url,headers=headers) # html.encoding="utf-8" # print(html.text) json1=json.loads(html.text)['data'] # print(json1) return etree.HTML(json1) # return json.loads(html.text)['data'] #起始微博 a=request1("https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100306&profile_ftype=1&is_all=1&pagebar=1&pl_name=Pl_Official_MyProfileFeed__22&id=1003061246130430&script_uri=/weidaxun&feed_type=0&page=1&pre_page=1&domain_op=100306&__rnd=1542871737489") # aa=etree.HTML(a) new_html=a.xpath("//div[@class='W_pages']//li/a") print(new_html) print(len(new_html)) for i in range(len(new_html)): print("第頁%s內容:"%(i+1)) html1=request1("https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100306&is_search=0&visible=0&is_all=1&is_tag=0&profile_ftype=1&page={}&pl_name=Pl_Official_MyProfileFeed__22&id=1003061246130430&script_uri=/weidaxun".format(i+1)) print(html1.xpath('//div[@class="WB_text W_f14"]/text()')) html2=request1('https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100306&profile_ftype=1&is_all=1&pagebar=0&pl_name=Pl_Official_MyProfileFeed__22&id=1003061246130430&script_uri=/weidaxun&feed_type=0&page={}&pre_page={}'.format(i+1,i+1)) print(html2.xpath('//div[@class="WB_text W_f14"]/text()')) html3=request1('https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100306&profile_ftype=1&is_all=1&pagebar=1&pl_name=Pl_Official_MyProfileFeed__22&id=1003061246130430&script_uri=/weidaxun&feed_type=0&page={}&pre_page={}'.format(i+1,i+1)) print(html3.xpath('//div[@class="WB_text W_f14"]/text()'))
二、selenium模擬登陸及下拉
需要注意的主要就是下拉到底判斷條件的實現方法,這樣也可以完整獲取微博內容,就是效率很低
from selenium import webdriver from selenium.webdriver.support.wait import WebDriverWait import time import re from lxml import etree import requests import json driver=webdriver.Chrome() driver.set_window_size(1920,800) driver.get("http://www.weibo.com") time.sleep(5) elem_usr=driver.find_element_by_xpath('//*[@id="loginname"]') print(elem_usr) elem_usr.send_keys("18514476337") elem_pwd=driver.find_element_by_xpath('//*[@id="pl_login_form"]/div/div[3]/div[2]/div/input') elem_pwd.send_keys("******") elem_sub=driver.find_element_by_xpath('//*[@id="pl_login_form"]/div/div[3]/div[6]/a/span') elem_sub.click() time.sleep(3) def wb_list(url): driver.get(url) time.sleep(1) # driver.execute_script('window.scrollBy(0,1000)') t=True while t: driver.execute_script('window.scrollBy(0,3000)') try: time.sleep(2) # driver.find_element(by=By.LINK_TEXT,value='下一頁').text driver.find_element_by_link_text('下一頁') # driver.scrollTo(0, document.body.scrollHeight) # print(driver.page_source) time.sleep(3) t=False except: pass return driver.page_source # c=driver.page_source # cc=re.findall('',c,re.S) # driver.find_element_by_link_text('下一頁').click() aa=wb_list("https://weibo.com/weidaxun?profile_ftype=1&is_all=1#_0") print(aa)