爬取需要登入的網站資料
阿新 • • 發佈:2019-02-13
爬去某大學某學生的課程表
先檢視該網站的request方法和資料表單
request方法是get
表單頭為u和p
import urllib.request
import http.cookiejar
import urllib.parse
from urllib.request import urlopen
url="https://gsdb.bjtu.edu.cn/client/login/"
agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/60.0.3112.113 Chrome/60.0.3112.113 Safari/537.36'
cookie=http.cookiejar.CookieJar()
opener=urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))
headers = {'User-Agent':agent}
postdata=urllib.parse.urlencode({'u':'XXXXXX','p':'××××××'})
postdata=postdata.encode('UTF-8')
request=urllib.request.Request(url,postdata,headers)
result =opener.open(request)
print(result.read().decode('UTF-8'))
登入成功~
登入後就可以訪問使用者的各個網頁了。
result=opener.open('https://gsdb.bjtu.edu.cn/course_selection/select/schedule/')
print(result.read().decode('utf-8'))
爬取課程表
pattern=re.compile('<tr>(.*?)</tr>',re.S)
items=re.findall(pattern,pagecode)
for item in items:
pat=re.compile('<td>(.*?)</td>',re.S)
its=re.findall(pat,item)
for it in its:
print(it)
執行成功!
分佈執行
cookie.py 把登入網站的cookie資訊儲存到cookie.txt裡。
import urllib.request
import re
import http.cookiejar
import urllib.parse
filename='cookie.txt'
#cookie=http.cookiejar.CookieJar(filename)
cookie=http.cookiejar.MozillaCookieJar(filename)
url="https://gsdb.bjtu.edu.cn/client/login/"
agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/60.0.3112.113 Chrome/60.0.3112.113 Safari/537.36'
opener=urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))
headers = {'User-Agent':agent}
postdata=urllib.parse.urlencode({'u':'17126266','p':'02007X'})
postdata=postdata.encode('UTF-8')
request=urllib.request.Request(url,postdata,headers)
result=opener.open(request)
print(result.read().decode('utf-8'))
cookie.save(ignore_discard=True,ignore_expires=True)
spider.py載入cookie.py模組,從cookie.txt裡面加載出cookie資訊,這樣就可以模擬登入。
import urllib.request
import re
import http.cookiejar
import cookie
cookie=cookie=http.cookiejar.MozillaCookieJar()
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
opener=urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))
result=opener.open('https://gsdb.bjtu.edu.cn/course_selection/select/schedule/')
# print(result.read().decode('UTF-8'))
pagecode=result.read().decode('utf-8')
pattern=re.compile('<tr>(.*?)</tr>',re.S)
items=re.findall(pattern,pagecode)
for item in items:
pat=re.compile('<td>(.*?)</td>',re.S)
its=re.findall(pat,item)
for it in its:
print(it)
執行成功!