「docker實戰篇」python的docker-抖音web端資料抓取(19)
抖音抓取實戰,為什麼沒有抓取資料?例如:有個網際網路的電商生鮮公司,這個公司老闆想在一些流量上投放廣告,通過增加公司產品曝光率的方式,進行營銷,在投放的選擇上他發現了抖音,抖音擁有很大的資料流量,嘗試的想在抖音上投放廣告,看看是否利潤和效果有收益。他們分析抖音的資料,分析抖音的使用者畫像,判斷使用者的群體和公司的匹配度,需要抖音的粉絲數,點贊數,關注數,暱稱。通過使用者喜好將公司的產品融入到視訊中,更好的推廣公司的產品。一些公關公司通過這些資料可以找到網紅黑馬,進行營銷包裝。原始碼:https://github.com/limingios/dockerpython.git (douyin)
抖音分享頁面
-
介紹
>https://www.douyin.com/share/user/使用者ID,使用者ID通過原始碼中的txt中獲取,然後通過連結的方式就可以開啟對應的web端頁面。然後通過web端頁面。爬取基本的資訊。
-
安裝谷歌xpath helper工具
>原始碼中獲取crx
谷歌瀏覽器輸入:chrome://extensions/
直接將xpath-helper.crx 拖入介面chrome://extensions/
安裝成功後
快捷鍵 ctrl+shift+x 啟動xpath,一般都是谷歌的f12 開發者工具配合使用。
開始python 爬取抖音分享的網站資料
分析分享頁面https://www.douyin.com/share/user/76055758243
1.抖音做了反派機制,抖音ID中的數字變成了字串,進行替換。
{'name':[' ',' ',' '],'value':0}, {'name':[' ',' ',' '],'value':1}, {'name':[' ',' ',' '],'value':2}, {'name':[' ',' ',' '],'value':3}, {'name':[' ',' ',' '],'value':4}, {'name':[' ',' ',' '],'value':5}, {'name':[' ',' ',' '],'value':6}, {'name':[' ',' ',' '],'value':7}, {'name':[' ',' ',' '],'value':8}, {'name':[' ',' ',' '],'value':9},
2.獲取需要的節點的的xpath
# 暱稱 //div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text() #抖音ID //div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text() #工作 //div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text() #描述 //div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text() #地址 //div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text() #星座 //div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text() #關注數 //div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text() #粉絲數 //div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text() #贊數 //div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']/span[@class='num']/text()
- 完整程式碼
import re import requests import time from lxml import etree def handle_decode(input_data,share_web_url,task): search_douyin_str = re.compile(r'抖音ID:') regex_list = [ {'name':[' ',' ',' '],'value':0}, {'name':[' ',' ',' '],'value':1}, {'name':[' ',' ',' '],'value':2}, {'name':[' ',' ',' '],'value':3}, {'name':[' ',' ',' '],'value':4}, {'name':[' ',' ',' '],'value':5}, {'name':[' ',' ',' '],'value':6}, {'name':[' ',' ',' '],'value':7}, {'name':[' ',' ',' '],'value':8}, {'name':[' ',' ',' '],'value':9}, ] for i1 in regex_list: for i2 in i1['name']: input_data = re.sub(i2,str(i1['value']),input_data) share_web_html = etree.HTML(input_data) douyin_info = {} douyin_info['nick_name'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0] douyin_id = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']/p[@class='shortid']/i/text()")) douyin_info['douyin_id'] = re.sub(search_douyin_str,'',share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]).strip() + douyin_id try: douyin_info['job'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text()")[0].strip() except: pass douyin_info['describe'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text()")[0].replace('\n',',') douyin_info['location'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text()")[0] douyin_info['xingzuo'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text()")[0] douyin_info['follow_count'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text()")[0].strip() fans_value = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text()")) unit = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']/span[@class='num']/text()") if unit[-1].strip() == 'w': douyin_info['fans'] = str((int(fans_value)/10))+'w' like = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']//i[@class='icon iconfont follow-num']/text()")) unit = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']/span[@class='num']/text()") if unit[-1].strip() == 'w': douyin_info['like'] = str(int(like)/10)+'w' douyin_info['from_url'] = share_web_url print(douyin_info) def handle_douyin_web_share(share_id): share_web_url = 'https://www.douyin.com/share/user/'+share_id print(share_web_url) share_web_header = { 'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36' } share_web_response = requests.get(url=share_web_url,headers=share_web_header) handle_decode(share_web_response.text,share_web_url,share_id) if __name__ == '__main__': while True: share_id = "76055758243" if share_id == None: print('當前處理task為:%s'%share_id) break else: print('當前處理task為:%s'%share_id) handle_douyin_web_share(share_id) time.sleep(2)
mongodb
通過vagrant 生成虛擬機器建立mongodb,具體檢視
「docker實戰篇」python的docker爬蟲技術-python指令碼app抓取(13)
su - #密碼:vagrant docker >https://hub.docker.com/r/bitnami/mongodb >預設埠:27017 ``` bash docker pull bitnami/mongodb:latest mkdir bitnami cd bitnami mkdir mongodb docker run -d -v /path/to/mongodb-persistence:/root/bitnami -p 27017:27017 bitnami/mongodb:latest #關閉防火牆 systemctl stop firewalld
-
操作mongodb
>讀txt檔案獲取userId的編號。
#!/usr/bin/env python # -*- coding: utf-8 -*- # @Time: 2019/1/30 19:35 # @Author: Aries # @Site: # @File: handle_mongo.py.py # @Software: PyCharm import pymongo from pymongo.collection import Collection client = pymongo.MongoClient(host='192.168.66.100',port=27017) db = client['douyin'] def handle_init_task(): task_id_collections = Collection(db, 'task_id') with open('douyin_hot_id.txt','r') as f: f_read = f.readlines() for i in f_read: task_info = {} task_info['share_id'] = i.replace('\n','') task_id_collections.insert(task_info) def handle_get_task(): task_id_collections = Collection(db, 'task_id') # return task_id_collections.find_one({}) return task_id_collections.find_one_and_delete({}) #handle_init_task()
- 修改python程式呼叫
import re import requests import time from lxml import etree from handle_mongo import handle_get_task from handle_mongo import handle_insert_douyin def handle_decode(input_data,share_web_url,task): search_douyin_str = re.compile(r'抖音ID:') regex_list = [ {'name':[' ',' ',' '],'value':0}, {'name':[' ',' ',' '],'value':1}, {'name':[' ',' ',' '],'value':2}, {'name':[' ',' ',' '],'value':3}, {'name':[' ',' ',' '],'value':4}, {'name':[' ',' ',' '],'value':5}, {'name':[' ',' ',' '],'value':6}, {'name':[' ',' ',' '],'value':7}, {'name':[' ',' ',' '],'value':8}, {'name':[' ',' ',' '],'value':9}, ] for i1 in regex_list: for i2 in i1['name']: input_data = re.sub(i2,str(i1['value']),input_data) share_web_html = etree.HTML(input_data) douyin_info = {} douyin_info['nick_name'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0] douyin_id = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']/p[@class='shortid']/i/text()")) douyin_info['douyin_id'] = re.sub(search_douyin_str,'',share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]).strip() + douyin_id try: douyin_info['job'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text()")[0].strip() except: pass douyin_info['describe'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text()")[0].replace('\n',',') douyin_info['location'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text()")[0] douyin_info['xingzuo'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text()")[0] douyin_info['follow_count'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text()")[0].strip() fans_value = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text()")) unit = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']/span[@class='num']/text()") if unit[-1].strip() == 'w': douyin_info['fans'] = str((int(fans_value)/10))+'w' like = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']//i[@class='icon iconfont follow-num']/text()")) unit = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']/span[@class='num']/text()") if unit[-1].strip() == 'w': douyin_info['like'] = str(int(like)/10)+'w' douyin_info['from_url'] = share_web_url print(douyin_info) handle_insert_douyin(douyin_info) def handle_douyin_web_share(task): share_web_url = 'https://www.douyin.com/share/user/'+task["share_id"] print(share_web_url) share_web_header = { 'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36' } share_web_response = requests.get(url=share_web_url,headers=share_web_header) handle_decode(share_web_response.text,share_web_url,task["share_id"]) if __name__ == '__main__': while True: task=handle_get_task() handle_douyin_web_share(task) time.sleep(2)
-
mongodb欄位
>handle_init_task 是將txt存入mongodb中
>handle_get_task 查出來一條然後刪除一條,因為txt是存在的,所以刪除根本沒有關係
#!/usr/bin/env python # -*- coding: utf-8 -*- # @Time: 2019/1/30 19:35 # @Author: Aries # @Site: # @File: handle_mongo.py.py # @Software: PyCharm import pymongo from pymongo.collection import Collection client = pymongo.MongoClient(host='192.168.66.100',port=27017) db = client['douyin'] def handle_init_task(): task_id_collections = Collection(db, 'task_id') with open('douyin_hot_id.txt','r') as f: f_read = f.readlines() for i in f_read: task_info = {} task_info['share_id'] = i.replace('\n','') task_id_collections.insert(task_info) def handle_insert_douyin(douyin_info): task_id_collections = Collection(db, 'douyin_info') task_id_collections.insert(douyin_info) def handle_get_task(): task_id_collections = Collection(db, 'task_id') # return task_id_collections.find_one({}) return task_id_collections.find_one_and_delete({}) handle_init_task()
PS:text文字中的資料1000條根本不夠爬太少了,實際上是app端和pc端配合來進行爬取的,pc端負責初始化的資料,通過userID獲取到粉絲列表然後在不停的迴圈來進行爬取,這樣是不是就可以獲取到很大量的資料。
>>原創文章,歡迎轉載。轉載請註明:轉載自IT人故事會,謝謝!
>>原文連結地址:上一篇:已是最新文章