1. 程式人生 > >暑假閑著沒事第一彈:基於Django的長江大學教務處成績查詢系統

暑假閑著沒事第一彈:基於Django的長江大學教務處成績查詢系統

錯誤 utf8 忽略 服務器 chrom -- eight character 方式

本篇文章涉及到的知識點有:Python爬蟲,MySQL數據庫,html/css/js基礎,selenium和phantomjs基礎,MVC設計模式,django框架(Python的web開發框架),apache服務器,linux(centos 7為例)基本操作。因此適合有以上基礎的同學學習。

聲明:本博文只是為了純粹的技術交流,敏感信息本文會有所過濾,大家見諒(由於任何緣故導致長江大學教務處網站出現問題,都與本人無關)。

實現思路:在沒有教務處數據接口的前提下(學生的信息安全),那也只有自己寫爬蟲去模擬登陸教務處,然後爬數據,為了防止教務處網站崩潰,導致爬蟲失敗,可以進行數據緩存,下次可以直接從自己的數據庫中取數據,而我們要做的就是定時更新數據與教務處實現同步。

技術架構:centos 7 + apache2.4 + mariadb5.5 + Python2.7.5 + mod_wsgi 3.4 + django1.11

------------------------------------------------------------------------

一、Python爬蟲:

1、先看一下登錄入口 技術分享

我們這裏用FireFox進行抓包分析,我們發現登錄是post上去的,並且帶有7個參數,發現有驗證碼,此時有兩種解決辦法,一種是運用現在很火的技術用DL做圖片識別,一種是down下來讓用戶自己輸。第一種成本比較高。。等不忙了可以試一下,記得Python有個庫叫Pillow還是PIL可以做圖片識別,,暑假用TF試一下。第二種很low就不說了。

2、 還有種高大上的方式,,,可以不用管驗證碼,這裏就不細說了,我們模擬登陸上去:

#coding:utf8
from bs4 import BeautifulSoup
import urllib
import urllib2
import requests
import sys

reload(sys)
sys.setdefaultencoding(gbk)

loginURL = "教務處登陸地址"
cjcxURL = "http://jwc2.yangtzeu.edu.cn:8080/cjcx.aspx"
html = urllib2.urlopen(loginURL)
soup = BeautifulSoup(html,"
lxml") __VIEWSTATE = soup.find(id="__VIEWSTATE")["value"] __EVENTVALIDATION = soup.find(id="__EVENTVALIDATION")["value"] data = { "__VIEWSTATE":__VIEWSTATE, "__EVENTVALIDATION":__EVENTVALIDATION, "txtUid":"賬號", "btLogin":"%B5%C7%C2%BC", "txtPwd":"密碼", "selKind":"1" } header = { # "Host":"jwc2.yangtzeu.edu.cn:8080", "User-Agent":"Mozilla/5.0 (Windows NT 10.0;… Gecko/20100101 Firefox/54.0", "Accept":"text/html,application/xhtml+x…lication/xml;q=0.9,*/*;q=0.8", "Accept-Language":"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3", "Accept-Encoding":"gzip, deflate", "Content-Type":"application/x-www-form-urlencoded", # "Content-Length":"644", "Referer":"http://jwc2.yangtzeu.edu.cn:8080/login.aspx", # "Cookie":"ASP.NET_SessionId=3zjuqi0cnk5514l241csejgx", # "Connection":"keep-alive", # "Upgrade-Insecure-Requests":"1", } UserSession = requests.session() Request = UserSession.post(loginURL,data,header) Response = UserSession.get(cjcxURL,cookies = Request.cookies,headers=header) soup = BeautifulSoup(Response.content,"lxml") print soup

接下來我們可以看到:

技術分享

再來post(此代碼接上面):

__VIEWSTATE2 = soup.find(id="__VIEWSTATE")["value"]
__EVENTVALIDATION2 = soup.find(id="__EVENTVALIDATION")["value"]

AllcjData = {
            "__EVENTTARGET":"btAllcj",
            "__EVENTARGUMENT":"",
            "__VIEWSTATE":__VIEWSTATE2,
            "__EVENTVALIDATION":__EVENTVALIDATION2,
            "selYear":"2017",
            "selTerm":"1",
#            "Button2":"%B1%D8%D0%DE%BF%CE%B3%C9%BC%A8"
        }
AllcjHeader = {
#       "Host":"jwc2.yangtzeu.edu.cn:8080",
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0;… Gecko/20100101 Firefox/54.0",
        "Accept":"text/html,application/xhtml+x…lication/xml;q=0.9,*/*;q=0.8",
        "Accept-Language":"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
        "Accept-Encoding":"gzip, deflate",
        "Content-Type":"application/x-www-form-urlencoded",
#        "Content-Length":"644",
        "Referer":"http://jwc2.yangtzeu.edu.cn:8080/cjcx.aspx",
#        "Cookie":,
        "Connection":"keep-alive",
        "Upgrade-Insecure-Requests":"1",
        }
Request1 = UserSession.post(cjcxURL,AllcjData,AllcjHeader)
Response1 = UserSession.get(cjcxURL,cookies = Request.cookies,headers=AllcjHeader)
soup = BeautifulSoup(Response1.content,"lxml")
print soup

發現不行。。。這次get的頁面還是原來的頁面。。。我覺得有兩種原因導致這次post失敗:一是asp.net的__VIEWSTATE和__EVENTVALIDATION變量導致post失敗,二是一個form多個button用了js做判斷,導致爬蟲失敗,對於動態加載的頁面,普通爬蟲還是不行。。。。

3、再來點高大上的用selenium(web自動化測試工具,可以模擬鼠標點擊)+ phantomjs(沒有界面的瀏覽器,比chrome和Firefox都要快)

selenium安裝:pip install selenium

phantomjs安裝:

(1)地址:http://phantomjs.org/download.html(我下載的是Linux 64位的)

(2)解壓縮:tar -jxvf phantomjs-2.1.1-linux-x86_64.tar.bz2 /usr/share/  

(3)安裝依賴:yum install fontconfig freetype libfreetype.so.6 libfontconfig.so.1

(4)配置環境變量:export PATH=$PATH:/usr/share/phantomjs-2.1.1-linux-x86_64/bin

(5)shell下輸入phantomjs,如果能進入命令行,安裝成功。

請忽略我的註釋:

#coding:utf8
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import urllib
import urllib2
import sys 


reload(sys)
sys.setdefaultencoding(utf8)

driver = webdriver.PhantomJS();
driver.get("教務處登錄地址")
driver.find_element_by_name(txtUid).send_keys(‘賬號)
driver.find_element_by_name(txtPwd).send_keys(‘密碼)
driver.find_element_by_id(btLogin).click()
cookie=driver.get_cookies()
driver.get("http://jwc2.yangtzeu.edu.cn:8080/cjcx.aspx")
#print driver.page_source
#driver.find_element_by_xpath("//input[@name=‘btAllcj‘][@type=‘button‘]")
#js = "document.getElementById(‘btAllcj‘).onclick=function(){__doPostBack(‘btAllcj‘,‘‘)}"
#js = "var ob; ob=document.getElementById(‘btAllcj‘);ob.focus();ob.click();)"
#driver.execute_script("document.getElementById(‘btAllcj‘).click();")
#time.sleep(2)                            #讓操作稍微停一下
#driver.find_element_by_link_text("全部成績").click() #找到‘登錄’按鈕並點擊
#time.sleep(2)
#js1 = "document.Form1.__EVENTTARGET.value=‘btAllcj‘;"
#js2 = "document.Form1.__EVENTARGUMENT.value=‘‘;"
#driver.execute_script(js1)
#driver.execute_script(js2)
#driver.find_element_by_name(‘__EVENTTARGET‘).send_keys(‘btAllcj‘)
#driver.find_element_by_name(‘__EVENTARGUMENT‘).send_keys(‘‘)
#js = "var input = document.createElement(‘input‘);input.setAttribute(‘type‘, ‘hidden‘);input.setAttribute(‘name‘, ‘__EVENTTARGET‘);input.setAttribute(‘value‘, ‘‘);document.getElementById(‘Form1‘).appendChild(input);var input = document.createElement(‘input‘);input.setAttribute(‘type‘, ‘hidden‘);input.setAttribute(‘name‘, ‘__EVENTARGUMENT‘);input.setAttribute(‘value‘, ‘‘);document.getElementById(‘Form1‘).appendChild(input);var theForm = document.forms[‘Form1‘];if (!theForm) {    theForm = document.Form1;}function __doPostBack(eventTarget, eventArgument) {    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {        theForm.__EVENTTARGET.value = eventTarget;        theForm.__EVENTARGUMENT.value = eventArgument;        theForm.submit();    }   }__doPostBack(‘btAllcj‘, ‘‘)"
#js = "var script = document.createElement(‘script‘);script.type = ‘text/javascript‘;script.text=‘if (!theForm) {    theForm = document.Form1;}function __doPostBack(eventTarget, eventArgument) {    if     (!theForm.onsubmit || (theForm.onsubmit() != false)) {        theForm.__EVENTTARGET.value = eventTarget;        theForm.__EVENTARGUMENT.value = eventArgument;        theForm.submit();  }}‘;document.body.appendChild(script);"
#driver.execute_script(js)
driver.find_element_by_name("Button2").click()
html=driver.page_source
soup = BeautifulSoup(html,"lxml")
print soup
tables = soup.findAll("table")
for tab in tables:
  for tr in tab.findAll("tr"):
    print "--------------------"
    for td in tr.findAll("td")[0:3]:
      print td.getText()

技術分享

現在只能拿到必修課成績。。。。。因為全部成績是ASP生成的js觸發的。。。而不是直接submit。。。正在尋找解決的辦法。下面開始我們數據庫的設計。。。

二、Mariadb學生數據庫設計,,,這裏引用了我們SQL server數據庫原理上機的內容。。。

技術分享

我的建庫語句:

create database jwc character set utf8;

use jwc;

create table Student(
    Sno char(9) primary key,
    Sname varchar(20) unique,
    Sdept char(20),
    Spwd char(20)
);
create table Course(
    Cno   char(2) primary key,
    Cname varchar(30) unique,
    Credit  numeric(2,1)
);
create table SC( 
    Sno char(9) not null,
    Cno char(2) not null,
    Grade int check(Grade>=0 and Grade<=100),
    primary key(Sno,Cno),
    foreign key(Sno) references Student(Sno),
    foreign key(Cno) references Course(Cno)
);

三、Python web環境的搭建(LNMP):

因為這次選的http服務器時apache,所以要安裝mod_wsgi(python通用網關接口)來實現apache和Python程序的交互。。。如果用nginx就要安裝配置uwsgi。。。類似java的servlet和PHP的php-fpm。

安裝:yum install mod_wsgi

配置:vim /etc/httpd/conf/httpd.conf

這個配置花費了我不少心思和時間。。。網上的有很多錯誤。。。最標準的Python web django開發配置。。。拿走不謝。

#config python web
LoadModule wsgi_module modules/mod_wsgi.so  
<VirtualHost *:8080>
    ServerAdmin [email protected]-Yan
    ServerName www.yuol.onlne
    ServerAlias yuol.online

    Alias /media/ /var/www/html/jwc/media/
    Alias /static/ /var/www/html/jwc/static/
    <Directory /var/www/html/jwc/static/>    
        Require all granted
    </Directory>
    
    WSGIScriptAlias / /var/www/html/jwc/jwc/wsgi.py 
#    DocumentRoot "/var/www/html/jwc/jwc"
    ErrorLog "logs/www.yuol.online-error_log"
    CustomLog "logs/www.yuol.online -access_log" common
    
    <Directory "/var/www/html/jwc/jwc">
        <Files wsgi.py>
            AllowOverride All 
            Options Indexes FollowSymLinks Includes ExecCGI
            Require all granted
        </Files>    
    </Directory>
</VirtualHost>

暑假閑著沒事第一彈:基於Django的長江大學教務處成績查詢系統