python偽代碼之爬取完美誌願全國歷年文理分數線運行代碼持續更新
阿新 • • 發佈:2018-06-13
Python 爬蟲 高考 項目 最近好多小夥伴說想搞個項目實戰類的,我就花了一點時間做了一個爬蟲項目(在代碼復制的時候可能會有點問題,縮格一下就沒有問題了)
想要獲取更多源碼或者答疑或者或者交流學習可以加群:725479218
想要獲取更多源碼或者答疑或者或者交流學習可以加群:725479218
# -*- coding:utf-8 -*- from function.data_tool import clean_data import hashlib import furl.furl from crawlers.downloader import Downloaderfrom function.parse_tool import xpath_parsefrom function.database_tool import auto_sqlseve down=Downloader(proxy=‘http://104.224.138.224:8888/proxy‘) a = {‘吉林‘: ‘22‘, ‘河北‘: ‘13‘, ‘陜西‘: ‘61‘, ‘山西‘: ‘14‘, ‘青海‘: ‘63‘, ‘湖南‘: ‘43‘, ‘廣東‘: ‘44‘, ‘安徽‘: ‘34‘, ‘四川‘: ‘51‘, ‘江西‘: ‘36‘, ‘浙江‘: ‘33‘, ‘貴州‘: ‘52‘, ‘新疆‘: ‘65‘, ‘內蒙古‘: ‘15‘, ‘西藏‘: ‘54‘, ‘江蘇‘: ‘32‘, ‘廣西‘: ‘45‘, ‘湖北‘: ‘42‘, ‘海南‘: ‘46‘, ‘河南‘: ‘41‘, ‘山東‘: ‘37‘, ‘福建‘: ‘35‘, ‘雲南‘: ‘53‘, ‘上海‘: ‘31‘, ‘北京‘: ‘11‘, ‘天津‘: ‘12‘, ‘甘肅‘: ‘62‘, ‘寧夏‘: ‘64‘, ‘黑龍江‘: ‘23‘, ‘重慶‘: ‘50‘, ‘遼寧‘: ‘21‘} for province in b: for subject in c: field_info=[] key_word=a[province] reform_url.args[‘type‘]=subject reform_url.args[‘province‘]=key_word response=down.get(url=reform_url,typ=‘text‘,encoding=‘utf-8‘) htmlcode = eval(clean_data.clean_space(response))[‘htmlStr‘] xpath_html = xpath_parse.text_tolxml(htmlcode) year = xpath_html.xpath(‘string(//th[normalize-space(text())="錄取批次"]/..)‘).replace(‘\r‘, ‘‘).replace(‘\t‘,‘‘).replace( ‘錄取批次‘, ‘‘).replace(‘ ‘, ‘‘) year_split = year.split() ben_yi = xpath_html.xpath(‘string(//td[normalize-space(text())="本科第一批"]/..)‘).replace(‘\r‘, ‘‘).replace(‘\t‘, ‘‘).replace( ‘本科第一批‘, ‘‘).replace(‘ ‘, ‘‘) ben_yi_split = ben_yi.split() ben_er = xpath_html.xpath(‘string(//td[normalize-space(text())="本科第二批"]/..)‘).replace(‘\r‘, ‘‘).replace(‘\t‘, ‘‘).replace( ‘本科第二批‘, ‘‘).replace(‘ ‘, ‘‘) ben_er_split = ben_er.split() ben_san = xpath_html.xpath(‘string(//td[normalize-space(text())="本科第三批"]/..)‘).replace(‘\r‘, ‘‘).replace(‘\t‘, ‘‘).replace( ‘本科第三批‘, ‘‘).replace(‘ ‘, ‘‘) ben_san_split = ben_san.split() zhuan_yi = xpath_html.xpath(‘string(//td[normalize-space(text())="專科第一批"]/..)‘).replace(‘\r‘, ‘‘).replace(‘\t‘, b = [‘安徽‘, ‘北京‘, ‘重慶‘, ‘福建‘, ‘甘肅‘, ‘貴州‘, ‘廣東‘, ‘廣西‘, ‘湖北‘, ‘海南‘, ‘黑龍江‘, ‘湖南‘, ‘河南‘, ‘河北‘, ‘吉林‘, ‘江西‘, ‘江蘇‘, ‘遼寧‘, ‘寧夏‘, ‘內蒙古‘, ‘青海‘, ‘山西‘, ‘山東‘, ‘陜西‘, ‘四川‘, ‘上海‘, ‘天津‘, ‘西藏‘, ‘新疆‘, ‘雲南‘, ‘浙江‘] c=[‘wen‘,‘li‘] url=‘https://www.wmzy.com/api/score/getScoreList?type=wen&province=33‘ reform_url=furl.furl(url) W=auto_sqlsever.Mssql(database=‘provincescore‘,datatable=[‘ScoreProvince‘]) ‘‘).replace( ‘專科第一批‘, ‘‘).replace(‘ ‘, ‘‘) zhuan_yi_split = zhuan_yi.split() zhuan_er = xpath_html.xpath(‘string(//td[normalize-space(text())="專科第二批"]/..)‘).replace(‘\r‘, ‘‘).replace(‘\t‘, ‘‘).replace( ‘專科第二批‘, ‘‘).replace(‘ ‘, ‘‘) zhuan_er_split = zhuan_er.split() if ‘wen‘ in subject: subject=‘文科‘ else: subject=‘理科‘ print(zhuan_yi_split,zhuan_er_split,ben_san_split,ben_er_split,ben_yi_split) provincemd5=[hashlib.md5(province.encode()).hexdigest()]*8 tiqian=[0]*8 field_info.extend([[province]*8,provincemd5,year_split,[subject]*8,tiqian,ben_yi_split,ben_er_split,ben_san_split,zhuan_yi_split,zhuan_er_split]) W.insert_data(field_info)
python偽代碼之爬取完美誌願全國歷年文理分數線運行代碼持續更新