1. 程式人生 > >20170820_python實時獲取某網站留言信息

20170820_python實時獲取某網站留言信息

brush 輸入 mozilla ssi 獲取 eight bdr num html

主要用的是request和bs4,遇到最大的問題是目標站是gb2312編碼,python3的編碼雖然比2的處理要好得多但還是好麻煩,

最開始寫的是用cookie模擬登陸,但是這個在實際使用中很麻煩,需要先登陸目標網站,然後把cookie復制下來拷貝到代碼中...懶惰是

第一動力!

準備用火狐的httpfox獲取下目標站post的數據和地址,發現火狐瀏覽器自動升級到了55.x,插件只能用在35.x版本,然後用chrome發現這

個網站提交post請求是打開了一個新的頁面,然後新頁面再點F12就晚了,看不到post了,然後百度一番發現可以設置新標簽頁開啟F12!如圖:

技術分享

然後就知道了這個網站都post了什麽數據,開始用requests模擬post,但是發現每次都登

錄失敗,而且抓取的網頁內容都是亂碼,用了str(‘info‘, encoding=‘utf-8‘)才有所好轉

發現根本就沒有登錄成功,然後提示輸入賬號密碼登錄。

技術分享靈光一閃!!!

估計是我post的數據是utf8而目標站接收post時是gb2312,根本看不懂啊!果斷把用戶名(用戶名是中文!!!) username.encode("gb2312")之後順利登錄成功!然後又

開啟了session保持cookie,持久化登錄。然後每分鐘判斷下最後一個id是否等於保存的id,判斷是否進行抓取。

#-*-coding:utf-8-*- #編碼聲明
import requests,re,time,json,os
from bs4 import BeautifulSoup
from time import strftime,gmtime

LOGIN_URL = ‘http://www.3456.tv/Default.aspx‘  #請求的URL地址
username = ‘用戶名‘
password = ‘password‘
DATA = {"web_top_two2$txtName":username.encode("gb2312"), "web_top_two2$txtPass":password, ‘__VIEWSTATE‘:‘/wEPDwULLTEyNzc4MjM2OTBkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBRh3ZWJfdG9wX3R3bzIkaW1nQnRuTG9naW6/pqbjQqV358GfYjdoiOK+Ek4VWA==‘,‘__EVENTVALIDATION‘:‘/wEWBAL3y5PLCgLHgt+5BgL3r9v/CgLX77PND5R1XxTeGn4lXvBDrb6OdRyc4Xlk‘,‘web_top_two2$imgBtnLogin.x‘:‘22‘,‘web_top_two2$imgBtnLogin.y‘:‘8‘}   #登錄系統的賬號密碼,也是我們請求數據
HEADERS = {‘User-Agent‘ : ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36‘} #模擬登陸的瀏覽器
S = requests.Session()
login = S.post(LOGIN_URL,data=DATA,headers=HEADERS)  #模擬登陸操作

def getData(num):
    url = ‘http://www.3456.tv/user/list_proxyall.html‘
    res = S.get(url)
    content = res.content
    return content

def getLast():
    url = ‘http://www.3456.tv/user/list_proxyall.html‘
    res = S.get(url)
    content = res.content
    soup = BeautifulSoup(content,‘html.parser‘)
    tb = soup.find_all(‘tr‘,style=‘text-align:center;‘)
    for tag in tb:
        see = tag.find(‘a‘, attrs={‘class‘:‘see‘})
        seestr = see[‘onclick‘]
        seenum = re.sub("\D", "", seestr)
        break
    return seenum

def isNew():
    newlastid = getLast()
    with open(‘lastid.txt‘) as txt:
        last = txt.read()
    if int(newlastid) != int(last):
        print(‘當前時間:‘ + strftime("%H-%M") + ‘,發現新留言,獲取中!‘)
        getNewuser()
    else:
        print(‘當前時間:‘ + strftime("%H-%M") + ‘,暫時沒有新留言‘)

def getNewuser():
    url = ‘http://www.3456.tv/user/list_proxyall.html‘
    res = S.get(url)
    content = res.content
    soup = BeautifulSoup(content,‘html.parser‘)
    tb = soup.find_all(‘tr‘,style=‘text-align:center;‘)

    with open(‘lastid.txt‘) as txt:
        last = txt.read()
    userinfo = ‘‘
    for tag in tb:
        see = tag.find(‘a‘, attrs={‘class‘:‘see‘})
        seestr = see[‘onclick‘]
        seenum = re.sub("\D", "", seestr)
        
        if int(seenum) == int(last):
            break
        userinfo += (str(seeInfo(int(seenum)), encoding = "utf-8") + ‘\n‘)

    userfilename = strftime("%H-%M") + ‘.txt‘
    with open( userfilename, ‘w‘) as f:
        f.write(str(userinfo))
    os.system(userfilename)

    with open(‘lastid.txt‘, ‘w‘) as f2:
        f2.write(str(getLast()))
    print(‘本次抓取完成,當前時間:‘ + strftime("%H-%M") + ‘,60秒後繼續執行‘)

def seeInfo(id):
    url = ‘http://www.3456.tv/user/protel.html‘
    info = {‘id‘:id}
    res = S.get(url,data=info)
    content = res.content
    return content

setsleep = 60 #修改這個設置每次抓取間隔,60為60秒

print(‘this time is today first time start?‘)
firststr = input(‘input yes or no and press enter: ‘)
if firststr == ‘yes‘:
    print(‘正在抓取中...‘)
    lastid = getLast()
    with open(‘lastid.txt‘, ‘w‘) as f:
        f.write(str(lastid))
    print(‘當前時間:‘ + strftime("%H:%M") + ‘,當前第一條數據id為‘ + lastid)
    print(str(setsleep) + ‘秒後繼續執行‘)
else:
    print(str(setsleep) + ‘秒後繼續執行‘)
while 1:
    isNew()
    time.sleep(int(setsleep))

20170820_python實時獲取某網站留言信息