1. 程式人生 > >淘寶爬取某人的所有購物訂單

淘寶爬取某人的所有購物訂單

很慢 gen xmlhttp tex bili username cbo border 數據挖掘

做風險控制和個人征信,需要做數據挖掘,第一步就是要爬到消費記錄,當然還有很多其他項包括收貨地址 寶貝收藏 快速退款額度 芝麻信用 綁定的手機等等,先要爬到數據才能分析。

淘寶直接請求登錄接口不可行,不知道post參數加密規則,(大公司安全就是做得好),用selenium操作瀏覽器來登錄得到driver的cookie,然後requests攜帶cookie去爬訂單。如果全部都由selenium爬取無疑很慢,所以selenium負責登錄就行。

上代碼。

#coding=utf-8
import time,random,requests,json

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.webdriver.common.desired_capabilities import DesiredCapabilities class Taobao(object): def __init__(self,name,password): self.name
=name self.password=password self.login_url=https://login.taobao.com/member/login.jhtml?redirectURL=https%3A%2F%2Fwww.taobao.com%2F self.order_url=https://buyertrade.taobao.com/trade/itemlist/asyncBought.htm?action=itemlist/BoughtQueryAction&event_submit_do_query=1&_input_charset=utf8
self.num=0 self.cost=0 def login(self):
     ###如果用phantomjs瀏覽器就用這個
# dcap = dict(DesiredCapabilities.PHANTOMJS) # dcap["phantomjs.page.settings.userAgent"] = (‘Mozilla/5.0(WindowsNT6.1;WOW64) AppleWebKit/537.36(KHTML, likeGecko) Chrome/59.0.3071.115Safari/537.36x-requested-with:XMLHttpRequest‘)#(random.choice(agents)) # dcap["phantomjs.page.settings.loadImages"] = True # driver = webdriver.PhantomJS(executable_path=‘C:\\Python27\\phantomjs.exe‘,desired_capabilities=dcap) driver=webdriver.Chrome() driver.get(self.login_url) driver.find_element_by_id(J_Quick2Static).click() WebDriverWait(driver, 30, 0.5).until(EC.presence_of_element_located((By.ID, TPL_username_1))) driver.find_element_by_id(TPL_username_1).send_keys(self.name) driver.save_screenshot(1.jpg) ##用phantomjs無界面瀏覽器最好需要截圖 driver.find_element_by_id(TPL_password_1).send_keys(self.password) driver.save_screenshot(2.jpg) driver.find_element_by_id(J_SubmitStatic).click() time.sleep(10) driver.save_screenshot(3.jpg) self.cookies={} for dictx in driver.get_cookies(): self.cookies[dictx[name]]=dictx[value] driver.quit() def get_orders(self,p,flag): if flag==0: self.login() print self.cookies datax={pageNum:p+1, pageSize:15, prePageNo:p, } header = {origin: https://buyertrade.taobao.com, ###origin和refere一定需要,否則會請求不到訂單數據 referer:https://buyertrade.taobao.com/trade/itemlist/list_bought_items.htm, user-agent:Mozilla/5.0(WindowsNT6.1;WOW64) AppleWebKit/537.36(KHTML, likeGecko) Chrome/59.0.3071.115Safari/537.36x-requested-with:XMLHttpRequest, #‘cookie‘:‘miid=387872062667523128; thw=cn;xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.....‘, ##如果不用瀏覽器登錄,可以在headrs中攜帶字符串形式的cookie } resp=requests.post(self.order_url,data=datax,cookies=self.cookies,headers=header) #resp=requests.post(self.order_url,data=datax,headers=header) #print resp.content.decode(‘gbk‘) orders_dictx = json.loads(resp.content.decode(gbk)) pages=orders_dictx[page][totalPage] for order in orders_dictx[mainOrders]: self.num+=1 self.cost+=float(order[payInfo][actualFee]) print self.num, ,order[subOrders][0][itemInfo][title], 價格是: ,order[payInfo][actualFee],元 交易狀態是:,order[statusInfo][text],self.cost if flag==0: for p in range(1,pages+1): self.get_orders(p,1) if __name__=="__main__": pass tb=Taobao([email protected],123xxxxxxxx) tb.get_orders(0,0)

運行後爬到的訂單。

技術分享

要爬很多項,已購買寶貝只是其中之一,賬號 密碼要做成做接口傳過來觸發爬蟲。然後保存各項數據,做數據挖掘用。

根據統計,我在淘寶購物了205次,花費了28613.53元。

淘寶爬取某人的所有購物訂單