[Python] [爬蟲] 6.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——網頁解析器

阿新 • • 發佈：2018-11-09

1.Intro

檔名：pageResolver.py

模組名：網頁解析器

引用庫：

re	lxml	datetime	sys
retry	random	urllib2

自定義引用檔案：configManager

功能：解析網頁原始碼，獲得相應的資料，以字典形式儲存行記錄，最後返回包含字典物件的列表。

2.Source

#!/usr/bin/env Python
# -*- coding: utf-8 -*-
'''
# Author  : YSW
# Time    : 2018/6/6 14:04
# File    : pageResolver.py
# Version : 1.1
# Describe: 網頁解析器
# Update  :
        1.增加了中標網頁的解析方法
'''

import re
from lxml import etree
import datetime
import sys
from retry import retry
import configManager
import random
import urllib2
# 設定預設編碼，防止出現中文字元亂碼
defaultencoding = 'utf-8'
if sys.getdefaultencoding() != defaultencoding:
    reload(sys)
    sys.setdefaultencoding(defaultencoding)

HEADERS = {
    "User-Agent": random.choice(configManager.headers)
}

class Resolver(object):
    def time_parse(self, currentTime):
        '''
        獲取系統當前時間，返回規約後的時間資訊
        :param currentTime: 當前時間（字串型別）
        :return:當前時間（時間型別）
        '''
        date = datetime.datetime.strptime(currentTime, '%Y-%m-%d')
        return date

    #### 招投標資料 ####

    @retry(tries=3, delay=2)
    def resovler_ynsggzxxt(self, html, page_num):
        '''
        雲南省公共資源交易中心電子服務系統解析器
        :param html: 網頁原始碼
        :param page_num: 網頁頁數
        :return: 返回包含資料字典的列表
        '''
        print("[+] 正在解析第{0}頁資訊".format(page_num))
        # 儲存的列表
        resolveResult = []

        # 解析原始碼並返回 XML 物件
        text = etree.HTML(html)

        # 獲取招標資訊
        xpathPattern = "//div/table[@id='data_tab']/tbody/tr"

        # 通過 xpath 返回符合匹配的結果列表
        node_list = text.xpath(xpathPattern)

        # 正則規約欄位
        strParse = re.compile("\s")

        # 遍歷結果列表
        for node in node_list:
            # 篩除標題的空值標籤
            if len(node.xpath("./td")) > 0:
                # 專案編號
                projectNumber = node.xpath("./td")[1].text

                # 公告標題（正則規約）
                title = strParse.sub("", node.xpath("./td/a")[0].text)

                # 釋出時間
                startTime = node.xpath("./td")[3].text
                start_time = self.time_parse(startTime)

                # 截止時間
                endTime = node.xpath("./td")[4].text
                end_time = self.time_parse(endTime)

                # 狀態（正則規約）
                status = strParse.sub("", node.xpath("./td")[5].text)

                # 判斷狀態是否為空，如果為空，則跳轉到下一級標籤 i
                if status is "":
                    status = strParse.sub("", node.xpath("./td/i")[0].text)

                # href 連結地址
                href = "https://www.ynggzyxx.gov.cn" + str(node.xpath("./td/a/@href")[0])

                # 儲存到字典
                resolveMessage = {
                    "專案編號": projectNumber,
                    "公告標題": title,
                    "釋出時間": start_time,
                    "截止時間": end_time,
                    "狀態": status,
                    "連結": href,
                    "推送": False
                }
                resolveResult.append(resolveMessage)
        return resolveResult

    @retry(tries=3, delay=2)
    def resovler_ynsggzzw(self, html, page_num):
        '''
        雲南省公共資源交易中心網解析器
        :param html:
        :param page_num:
        :return:
        '''
        print("[+] 正在解析第{0}頁資訊".format(page_num))
        # 儲存的列表
        resolveResult = []

        # 獲取招標資訊
        xpathPattern = "//table[@id='data_tab']/tbody/tr"

        # 解析原始碼並返回 XML 物件
        text = etree.HTML(html)

        # 通過 xpath 返回符合匹配的結果列表
        node_list = text.xpath(xpathPattern)

        # 正則規約欄位
        strParse = re.compile("\s")

        # 遍歷結果列表
        for node in node_list:
            # 篩除標題的空值標籤
            if len(node.xpath("./td")) > 0:
                # 序號
                serialNumber = node.xpath("./td")[0].text

                # 專案編號
                projectNumber = node.xpath("./td")[1].text

                # href 連結地址
                href = "https://www.ynggzyxx.gov.cn" + str(node.xpath("./td/a/@href")[0])

                # 釋出時間
                startTime = node.xpath("./td")[3].text
                start_time = self.time_parse(startTime)

                # 公告標題（正則規約）
                title = strParse.sub("", node.xpath("./td/a")[0].text)

                # 儲存到字典
                resolveMessage = {
                    "專案編號": projectNumber,
                    "公告標題": title,
                    "釋出時間": start_time,
                    "連結": href,
                    "推送": False
                }
                resolveResult.append(resolveMessage)
        return resolveResult

    @retry(tries=3, delay=2)
    def resovler_kmsgg(self, html, page_num):
        '''
        昆明市公共資源交易中心網解析器
        :param html: 網頁原始碼
        :param page_num: 網頁頁數
        :return: 返回包含資料字典的列表
        '''
        print("[+] 正在解析第{0}頁資訊".format(page_num))
        # 儲存的列表
        resolveResult = []

        # 解析原始碼並返回 XML 物件
        text = etree.HTML(html)

        node_list = text.xpath("//div[@class='zb_from']/table/tbody/tr")
        for i in range(1, 16):
            # 編號
            num = node_list[i].xpath("./td")[1].text

            # 工程名稱
            project_name = (node_list[i].xpath("./@field_bdmcggbt")[0]).encode('utf8')

            # 連結
            href = "https://www.kmggzy.com/Jyweb/" + str(node_list[i].xpath("./td/a/@href")[0])

            start_time = None
            # 起始時間
            startTime = node_list[i].xpath("./td")[3].text
            if startTime is not None:
                start_time = self.time_parse(startTime)

            end_time = None
            # 結束時間
            endTime = node_list[i].xpath("./td")[4].text
            if endTime is not None:
                end_time = self.time_parse(endTime)

            status = None
            # 狀態
            if node_list[i].xpath("./td")[5].text is not None:
                status = (node_list[i].xpath("./td")[5].text).encode('utf8')

            # 儲存到字典
            if num and project_name and start_time and end_time and status is not None:
                resolveMessage = {
                    "編號": num,
                    "工程名稱": project_name,
                    "釋出時間": start_time,
                    "結束時間": end_time,
                    "狀態": status,
                    "連結": href,
                    "推送": False
                }
                resolveResult.append(resolveMessage)
        return resolveResult

    @retry(tries=3, delay=2)
    def resovler_kmsgg_gc(self, html, page_num):
        '''
        昆明市公共資源交易中心網解析器
        :param html: 網頁原始碼
        :param page_num: 網頁頁數
        :return: 返回包含資料字典的列表
        '''
        print("[+] 正在解析第{0}頁資訊".format(page_num))
        # 儲存的列表
        resolveResult = []

        # 解析原始碼並返回 XML 物件
        text = etree.HTML(html)

        node_list = text.xpath("//div[@class='zb_from']/table/tbody/tr")
        for i in range(1, 16):
            # 編號
            num = node_list[i].xpath("./td")[1].text

            # 工程名稱
            project_name = (node_list[i].xpath("./@field_bdmcggbt")[0]).encode('utf8')

            # 連結
            href = "https://www.kmggzy.com/Jyweb/" + str(node_list[i].xpath("./td/a/@href")[0])

            start_time = None
            # 起始時間
            startTime = node_list[i].xpath("./td")[3].text
            if startTime is not None:
                start_time = self.time_parse(startTime)

            end_time = None
            # 結束時間
            endTime = node_list[i].xpath("./td")[4].text
            if endTime is not None:
                end_time = self.time_parse(endTime)

            status = None
            # 狀態
            if node_list[i].xpath("./td")[5].text is not None:
                status = (node_list[i].xpath("./td")[5].text).encode('utf8')

            # 儲存到字典
            if num and project_name and start_time and end_time and status is not None:
                resolveMessage = {
                    "編號": num,
                    "工程名稱": project_name,
                    "釋出時間": start_time,
                    "結束時間": end_time,
                    "狀態": status,
                    "連結": href,
                    "推送": False
                }
                resolveResult.append(resolveMessage)
        return resolveResult

    @retry(tries=3, delay=2)
    def resovler_ynsggzxxt_zf(self, html, page_num):
        '''
        雲南省公共資源交易中心電子服務系統解析器 政府採購
        :param html: 網頁原始碼
        :param page_num: 網頁頁數
        :return: 返回包含資料字典的列表
        '''
        print("[+] 正在解析第{0}頁資訊".format(page_num))
        # 儲存的列表
        resolveResult = []

        # 解析原始碼並返回 XML 物件
        text = etree.HTML(html)

        # 獲取招標資訊
        xpathPattern = "//div/table[@id='data_tab']/tbody/tr"

        # 通過 xpath 返回符合匹配的結果列表
        node_list = text.xpath(xpathPattern)

        # 正則規約欄位
        strParse = re.compile("\s")

        # 遍歷結果列表
        for node in node_list:
            # 篩除標題的空值標籤
            if len(node.xpath("./td")) > 0:
                # 專案編號
                projectNumber = node.xpath("./td")[1].text

                # 公告標題（正則規約）
                title = strParse.sub("", node.xpath("./td/a")[0].text)

                # 釋出時間
                startTime = node.xpath("./td")[3].text
                start_time = self.time_parse(startTime)

                # 截止時間
                endTime = node.xpath("./td")[4].text
                end_time = self.time_parse(endTime)

                # 狀態（正則規約）
                status = strParse.sub("", node.xpath("./td")[5].text)

                # 判斷狀態是否為空，如果為空，則跳轉到下一級標籤 i
                if status is "":
                    status = strParse.sub("", node.xpath("./td/i")[0].text)

                # href 連結地址
                href = "https://www.ynggzyxx.gov.cn" + str(node.xpath("./td/a/@href")[0])

                # 儲存到字典
                resolveMessage = {
                    "專案編號": projectNumber,
                    "公告標題": title,
                    "釋出時間": start_time,
                    "截止時間": end_time,
                    "狀態": status,
                    "連結": href,
                    "推送": False
                }
                resolveResult.append(resolveMessage)

        return resolveResult

    @retry(tries=3, delay=2)
    def resovler_ynszfcgw(self, html, page_num):
        '''
        雲南省政府採購網
        :param html: 網頁原始碼
        :param page_num: 網頁頁數
        :return: 返回包含資料字典的列表
        '''
        print("[+] 正在解析第{0}頁資訊".format(page_num))
        # 儲存的列表
        resolveResult = []

        # 解析原始碼並返回 XML 物件
        text = etree.HTML(html)
        for i in range(0, 10):
            node_list = text.xpath("//tr[@data-row-id='{0}']".format(i))

            for node in node_list:
                text_total = node.xpath('./td')[0].xpath('./a')[0].text

                # 編號
                num = text_total[:text_total.find('：')]

                # 工程名稱
                project_name = text_total[text_total.find('：') + 1:]

                # 區劃
                area = node.xpath('./td')[2].text

                time_push = None
                # 釋出時間
                timePush = node.xpath('./td')[3].text
                if timePush is not None:
                    time_push = self.time_parse(timePush)

                # 連結
                cursor = node.xpath('./td')[0].xpath('./a/@data-bulletin_id')[0]

                href = "http://www.yngp.com/newbulletin_zz.do?method=preinsertgomodify&operator_state=1&flag=view&bulletin_id={0}".format(
                    cursor)

                # 儲存到字典
                if num and project_name and area and href and time_push is not None:
                    resolveMessage = {
                        "編號": num,
                        "工程名稱": project_name,
                        "釋出時間": time_push,
                        "區劃": area,
                        "連結": href,
                        "推送": False
                    }
                    resolveResult.append(resolveMessage)
        return resolveResult

    #### 中標資料 ####
    @retry(tries=3, delay=2)
    def get_url(self, url, proxy_dict):
        proxyIP = proxy_dict['ip']
        proxyPort = proxy_dict['port']
        proxyProtocol = proxy_dict['protocol']
        proxy_handler = urllib2.ProxyHandler({proxyProtocol: "{0}:{1}".format(proxyIP, proxyPort)})

        opener_proxy = urllib2.build_opener(proxy_handler)
        urllib2.install_opener(opener_proxy)
        request = urllib2.Request(url=url, headers=HEADERS)
        response = urllib2.urlopen(request)
        html = response.read()

        return html

    @retry(tries=3, delay=2)  # 70%
    def resovler_ynsggzxxt_gc_zb(self, html, page_num, proxy_dict):
        '''
        雲南省公共資源交易資訊網_工程建設_中標公告解析器
        :param html: 網頁原始碼
        :param page_num: 網頁頁數
        :return: 返回包含資料字典的列表
        '''
        def resolve_pp_0(html):
            try:
                people = ''
                price = 0.0
                text = etree.HTML(html)
                node_second_list = text.xpath("//div[@class='con']//tr")
                for node_second in node_second_list:
                    if "中標人：" == node_second.xpath("./td")[0].text:
                        people = node_second.xpath("./td")[1].xpath('./b//span')[0].text
                    if "中標價" in node_second.xpath("./td")[0].text:
                        totalCount = node_second.xpath("./td")[1].xpath('./b//span')[0].text
                        price = float(re.sub("\D", "", totalCount))
                return people, price
            except:
                return None, 0.0

        def resolve_pp_1(html):
            '''
            子網頁解析器_1
            eg: https://www.ynggzyxx.gov.cn/jyxx/jsgcZbjggsDetail?guid=7befec50-6cf1-49b1-a5ec-b3b1cf6d3ab2&isOther=false
            :param html:網頁原始碼
            :return:中標公司和中標價格
            '''
            try:
                people = ''
                price = 0.0
                text = etree.HTML(html)
                xpathPattern = "//div[@class='w1200s']//table"
                node_list = text.xpath(xpathPattern)[0]
                for index, node in enumerate(node_list):
                    if index == 7:
                        people = node.xpath('./td//tr')[1].xpath('./td')[1].text
                        price_tmp = node.xpath('./td//tr')[1].xpath('./td')[6].text
                        if price_tmp == 0 or price_tmp == '/':
                            price = float(0.0)
                # print("中標人： {0}，中標價：{1}".format(people, price))
                return people, price
            except:
                return None, 0.0

        def resolve_pp_2(html):
            '''
            子網頁解析器_2
            eg: https://www.ynggzyxx.gov.cn/jyxx/jsgcZbjggsDetail?guid=2ab5a6f5-30e2-4599-846b-22597815e3dd&isOther=false
            :param html:網頁原始碼
            :return:中標公司和中標價格
            '''
            try:
                people = ''
                price = 0.0
                text = etree.HTML(html)
                xpathPattern = "//div[@class='w1200s']//div[@class='detail_contect']//p"
                node_list = text.xpath(xpathPattern)
                for node in node_list:
                    if "第一中標候選人" in node.text:
                        people_tmp = str(node.text).strip()
                        people = people_tmp[people_tmp.find('：') + 3:]
                    elif "投標報價" in node.text:
                        price_tmp = node.xpath('./span')[0].text
                        price = float(price_tmp)
                # print("中標人： {0}，中標價：{1}".format(people, price))
                return people, price
            except:
                return None, 0.0

        def resolve_pp_3(html):
            '''
            子網頁解析器_3
            eg: https://www.ynggzyxx.gov.cn/jyxx/jsgcZbjggsDetail?guid=e145f187-b9d9-4573-b4b0-f5c4c66ddbdb&isOther=false
            :param html:網頁原始碼
            :return:中標公司和中標價格
            '''
            try:
                people = ''
                price = 0.0
                text = etree.HTML(html)
                xpathPattern = "//div[@class='w1200s']//div[@class='page_contect bai_bg']//tr"
                node_list = text.xpath(xpathPattern)
                for node in node_list:
                    ## 中標人
                    tmp = node.xpath('./td//span')[0].text
                    if "第一中標候選人" == tmp:
                        people = node.xpath('./td//span')[1].text

                    ## 中標價格
                    node_td = node.xpath('./td')
                    if len(node_td) > 3:
                        for no in node_td:
                            if len(no.xpath('./span')) > 0 and "中標價（萬元）" == no.xpath('./span')[0].text:
                                price = float(node_td[3].xpath('./span')[0].text)
                # print("中標人： {0}，中標價：{1}".format(people, price))
                return people, price
            except:
                return None, 0.0

        def resolve_pp_4(html):
            '''
            子網頁解析器_4
            eg: https://www.ynggzyxx.gov.cn/jyxx/jsgcZbjggsDetail?guid=562df3b5-207a-4f2e-b3f7-3b29736ae191&isOther=false
            :param html:網頁原始碼
            :return:中標公司和中標價格
            '''
            try:
                text = etree.HTML(html)
                xpathPattern = "//div[@class='w1200s']//div[@class='page_contect bai_bg']//tr"
                node_list = text.xpath(xpathPattern)
                node = node_list[12]

                people_td = node.xpath('./td')[1]
                people = people_td.xpath('./p/span')[0].text

                price_td = node.xpath('./td')[2]
                price_tmp = price_td.xpath('./p/span')[0].text
                price = float(price_tmp)

                return people, price
            except:
                return None, 0.0

        def resolve_pp_5(html):
            '''
            子網頁解析器_5
            eg: https://www.ynggzyxx.gov.cn/jyxx/jsgcZbjggsDetail?guid=61a3019b-33cb-44ba-a193-20c5d7f38543&isOther=false
            :param html:網頁原始碼
            :return:中標公司和中標價格
            '''
            try:
                text = etree.HTML(html)
                xpathPattern = "//div[@class='w1200s']//div[@class='page_contect bai_bg']//table"
                node_list = text.xpath(xpathPattern)
                tr_list = node_list[0].xpath('./tbody//tr')
                td_list = tr_list[1]
                people_td = td_list[2]
                people = people_td.xpath('./p/b/span')[0].text

                price_td = td_list[4]
                price_tmp = price_td.xpath('./p/b/span')[0].text
                price = float(price_tmp)

                return people, price
            except:
                return None, 0.0

        def resolve_pp_6(html):
            '''
            子網頁解析器_6
            eg: https://www.ynggzyxx.gov.cn/jyxx/jsgcZbjggsDetail?guid=e8cc5564-4664-4d45-aabd-2690a3366e2b&isOther=false
            :param html:網頁原始碼
            :return:中標公司和中標價格
            '''
            try:
                text = etree.HTML(html)
                xpathPattern = "//div[@class='w1200s']//div[@class='page_contect bai_bg']//table//td[@colspan='4']//tr"
                node_list = text.xpath(xpathPattern)

                people = node_list[1].xpath('./td')[1].text

                price_tmp = node_list[1].xpath('./td')[4].text
                price = float(price_tmp)

                return people, price
            except:
                return None, 0.0

        def resolve_pp_7(html):
            '''
            子網頁解析器_7
            eg: https://www.ynggzyxx.gov.cn/jyxx/jsgcZbjggsDetail?guid=2a7c021d-db9d-4dc5-8294-39083501dd9f&isOther=false
            :param html:網頁原始碼
            :return:中標公司和中標價格
            '''
            try:
                text = etree.HTML(html)
                xpathPattern = "//div[@class='w1200s']//div[@class='page_contect bai_bg']//table//tr"
                node_list = text.xpath(xpathPattern)
                people = node_list[9].xpath('./td')[1].xpath('./p/span')[0].text
                return people, 0.0
            except:
                return None, 0.0

        print("[+] 正在解析第{0}頁資訊".format(page_num))

        # 儲存的列表
        resolveResult = []

        # 解析原始碼並返回 XML 物件
        text = etree.HTML(html)

        xpathPattern = "//div/table[@id='data_tab']/tbody/tr"
        node_list = text.xpath(xpathPattern)

        for node in node_list:
            if len(node.xpath("./td")) > 0:
                project_name = node.xpath("./td//a")[0].text
                project_name_parse = project_name.replace('\n', '').replace(u'\t', '').replace(u' ', '')
                startTime = node.xpath("./td")[2].text
                start_time = self.time_parse(startTime)

                href = "https://www.ynggzyxx.gov.cn" + node.xpath('./td//a//@href')[0]

                html_second = self.get_url(href, proxy_dict)

                people, price = resolve_pp_0(html_second)
                if people == '':
                    people, price = resolve_pp_2(html_second)

                if people == '':
                    people, price = resolve_pp_1(html_second)

                if people == '':
                    people, price = resolve_pp_3(html_second)

                if people == None:
                    people, price = resolve_pp_4(html_second)

                if people == None:
                    people, price = resolve_pp_5(html_second)

                if people == None:
                    people, price = resolve_pp_6(html_second)

                if people == None:
                    people, price = resolve_pp_7(html_second)

                # 儲存到字典
                resolveMessage = {
                    "公告名稱": project_name_parse,
                    "釋出時間": start_time,
                    "連結": href,
                    "中標公司": people,
                    "中標價格": price,
                    "推送": False
                }
                resolveResult.append(resolveMessage)
        return resolveResult

    @retry(tries=3, delay=2)  # Done
    def resovler_ynsggzxxt_zf_zb(self, html, page_num):
        '''
        雲南省公共資源交易資訊網_政府採購_中標結果解析器
        :param html: 網頁原始碼
        :param page_num: 網頁頁數
        :return: 返回包含資料字典的列表
        '''
        print("[+] 正在解析第{0}頁資訊".format(page_num))

        # 儲存的列表
        resolveResult = []

        # 解析原始碼並返回 XML 物件
        text = etree.HTML(html)

        xpathPattern = "//div/table[@id='data_tab']/tbody/tr"
        node_list = text.xpath(xpathPattern)

        for node in node_list:
            if len(node.xpath("./td")) > 0:
                project_name = node.xpath("./td//a")[0].text
                project_name_parse = project_name.replace('\n', '').replace(u'\t', '').replace(u' ', '')
                startTime = node.xpath("./td")[2].text
                start_time = self.time_parse(startTime)

                href = "https://www.ynggzyxx.gov.cn" + node.xpath('./td//a//@href')[0]

                # 儲存到字典
                resolveMessage = {
                    "公告名稱": project_name_parse,
                    "釋出時間": start_time,
                    "連結": href,
                    "推送": False
                }
                resolveResult.append(resolveMessage)
        return resolveResult


    @retry(tries=3, delay=2)  # Done
    def resovler_ynsggzzw_gc_zb(self, html, page_num, proxy_dict):
        '''
        雲南省公共資源交易中心_工程建設_中標結果解析器
        :param html: 網頁原始碼
        :param page_num: 網頁頁數
        :return: 返回包含資料字典的列表
        '''
        def resolve_pp_1(html):
            '''
            子網頁解析器_1
            eg: https://www.ynggzy.com/jyxx/jsgcZbjggsDetail?guid=fbd514af-5716-4e30-bc1d-b42892986f85&isOther=false
            :param html:網頁原始碼
            :return:中標公司和中標價格
            '''
            try:
                people = ''
                price = ''
                text = etree.HTML(html)
                node_second_list = text.xpath("//div[@class='con']//tr")
                for node_second in node_second_list:
                    if "中標人：" == node_second.xpath("./td")[0].text:
                        people = node_second.xpath("./td")[1].xpath('./b//span')[0].text
                    if "中標價" in node_second.xpath("./td")[0].text:
                        totalCount = node_second.xpath("./td")[1].xpath('./b//span')[0].text
                        price = totalCount
                return people, price
            except:
                return None, ''

        print("[+] 正在解析第{0}頁資訊".format(page_num))
        # 儲存的列表
        resolveResult = []
        # 儲存的列表
        text = etree.HTML(html)
        xpathPattern = "//div/table[@id='data_tab']/tbody/tr"
        node_list = text.xpath(xpathPattern)

        # 正則規約欄位
        strParse = re.compile("\s")

        for node in node_list:
            if len(node.xpath("./td")) > 0:
                # 公告標題（正則規約）
                title = strParse.sub("", node.xpath("./td")[1].xpath("./a")[0].text)

                # 釋出時間
                startTime = node.xpath("./td")[2].text
                start_time = self.time_parse(startTime)

                # href 連結地址
                href = "https://www.ynggzy.com" + str(node.xpath("./td/a/@href")[0])
                html_second = self.get_url(href, proxy_dict)
                people, price = resolve_pp_1(html_second)
                # 儲存到字典
                resolveMessage = {
                    "公告標題": title,
                    "釋出時間": start_time,
                    "連結": href,
                    "中標公司": people,
                    "中標價格": price,
                    "推送": False
                }
                resolveResult.append(resolveMessage)
        return resolveResult

    @retry(tries=3, delay=2)  # Done
    def resovler_ynsggzzw_zf_zb(self, html, page_num):
        '''
        雲南省公共資源交易中心_政府採購_結果公示解析器
        :param html: 網頁原始碼
        :param page_num: 網頁頁數
        :return: 返回包含資料字典的列表
        '''
        print("[+] 正在解析第{0}頁資訊".format(page_num))
        # 儲存的列表
        resolveResult = []
        # 儲存的列表
        text = etree.HTML(html)
        xpathPattern = "//div/table[@id='data_tab']/tbody/tr"
        node_list = text.xpath(xpathPattern)

        # 正則規約欄位
        strParse = re.compile("\s")

        for node in node_list:
            if len(node.xpath("./td")) > 0:
                # 公告標題（正則規約）
                title = strParse.sub("", node.xpath("./td")[1].xpath("./a")[0].text)

                # 釋出時間
                startTime = node.xpath("./td")[2].text
                start_time = self.time_parse(startTime)

                # href 連結地址
                href = "https://www.ynggzy.com" + str(node.xpath("./td/a/@href")[0])
                # 儲存到字典
                resolveMessage = {
                    "公告標題": title,
                    "釋出時間": start_time,
                    "連結": href,
                    "推送": False
                }
                resolveResult.append(resolveMessage)
        return resolveResult


    @retry(tries=3, delay=2)  # Done
    def resolver_kmsgg_gc_zb(self, html, page_num):
        '''
        昆明市公共資源交易平臺公共服務系統_工程建設_中標結果公示解析器
        :param html: 網頁原始碼
        :param page_num: 網頁頁數
        :return: 返回包含資料字典的列表
        '''
        print("[+] 正在解析第{0}頁資訊".format(page_num))
        # 儲存的列表
        resolveResult = []

        # 解析原始碼並返回 XML 物件
        text = etree.HTML(html)

        node_list = text.xpath("//div[@class='zb_from']/table/tbody/tr")
        for i in range(1, 16):
            # 編號
            num = node_list[i].xpath("./td")[1].text

            # 工程名稱
            project_name = (node_list[i].xpath("./@field_bdmcggbt")[0]).encode('utf8')

            # 連結
            href = "https://www.kmggzy.com/Jyweb/" + str(node_list[i].xpath("./td/a/@href")[0])

            start_time = None
            # 釋出時間
            startTime = node_list[i].xpath("./td")[3].text
            if startTime is not None:
                start_time = self.time_parse(startTime)

            # 儲存到字典
            if num and project_name and start_time is not None:
                resolveMessage = {
                    "編號": num,
                    "工程名稱": project_name,
                    "釋出時間": start_time,
                    "連結": href,
                    "推送": False
                }
                resolveResult.append(resolveMessage)
        return resolveResult

    @retry(tries=3, delay=2)  # Done
    def resolver_kmsgg_zf_zb(self, html, page_num):
        '''
        昆明市公共資源交易平臺公共服務系統_政府採購_結果公示解析器
        :param html: 網頁原始碼
        :param page_num: 網頁頁數
        :return: 返回包含資料字典的列表
        '''
        print("[+] 正在解析第{0}頁資訊".format(page_num))
        # 儲存的列表
        resolveResult = []

        # 解析原始碼並返回 XML 物件
        text = etree.HTML(html)

        node_list = text.xpath("//div[@class='zb_from']/table/tbody/tr")
        for i in range(1, 16):
            # 編號
            num = node_list[i].xpath("./td")[1].text

            # 工程名稱
            project_name = (node_list[i].xpath("./@field_bdmcggbt")[0]).encode('utf8')

            # 連結
            href = "https://www.kmggzy.com/Jyweb/" + str(node_list[i].xpath("./td/a/@href")[0])

            start_time = None
            # 釋出時間
            startTime = node_list[i].xpath("./td")[3].text
            if startTime is not None:
                start_time = self.time_parse(startTime)

            # 儲存到字典
            if num and project_name and start_time is not None:
                resolveMessage = {
                    "編號": num,
                    "工程名稱": project_name,
                    "釋出時間": start_time,
                    "連結": href,
                    "推送": False
                }
                resolveResult.append(resolveMessage)
        return resolveResult

    @retry(tries=3, delay=2)  # Done
    def resolver_kmsgg_gc_by(self, html, page_num):
        '''
        昆明市公共資源交易平臺公共服務系統_工程建設_補遺通知解析器
        :param html: 網頁原始碼
        :param page_num: 網頁頁數
        :return: 返回包含資料字典的列表
        '''
        print("[+] 正在解析第{0}頁資訊".format(page_num))
        # 儲存的列表
        resolveResult = []

        # 解析原始碼並返回 XML 物件
        text = etree.HTML(html)

        node_list = text.xpath("//div[@class='zb_from']/table/tbody/tr")
        for i in range(1, 16):
            # 編號
            num = node_list[i].xpath("./td")[1].text

            # 工程名稱
            project_name = (node_list[i].xpath("./@field_bdmcggbt")[0]).encode('utf8')

            # 連結
            href = "https://www.kmggzy.com/Jyweb/" + str(node_list[i].xpath("./td/a/@href")[0])

            start_time = None
            # 釋出時間
            startTime = node_list[i].xpath("./td")[3].text
            if startTime is not None:
                start_time = self.time_parse(startTime)

            # 儲存到字典
            if num and project_name and start_time is not None:
                resolveMessage = {
                    "編號": num,
                    "工程名稱": project_name,
                    "釋出時間": start_time,
                    "連結": href,
                    "推送": False
                }
                resolveResult.append(resolveMessage)
        return resolveResult

    @retry(tries=3, delay=2)  # Done
    def resolver_kmsgg_zf_by(self, html, page_num):
        '''
        昆明市公共資源交易平臺公共服務系統_政府採購_補遺通知解析器
        :param html: 網頁原始碼
        :param page_num: 網頁頁數
        :return: 返回包含資料字典的列表
        '''
        print("[+] 正在解析第{0}頁資訊".format(page_num))
        # 儲存的列表
        resolveResult = []

        # 解析原始碼並返回 XML 物件
        text = etree.HTML(html)

        node_list = text.xpath("//div[@class='zb_from']/table/tbody/tr")
        for i in range(1, 16):
            # 編號
            num = node_list[i].xpath("./td")[1].text

            # 工程名稱
            project_name = (node_list[i].xpath("./@field_bdmcggbt")[0]).encode('utf8')

            # 連結
            href = "https://www.kmggzy.com/Jyweb/" + str(node_list[i].xpath("./td/a/@href")[0])

            start_time = None
            # 釋出時間
            startTime = node_list[i].xpath("./td")[3].text
            if startTime is not None:
                start_time = self.time_parse(startTime)

            # 儲存到字典
            if num and project_name and start_time is not None:
                resolveMessage = {
                    "編號": num,
                    "工程名稱": project_name,
                    "釋出時間": start_time,
                    "連結": href,
                    "推送": False
                }
                resolveResult.append(resolveMessage)
        return resolveResult


    @retry(tries=3, delay=2)  # Done
    def resolver_ynszfcgw_cg(self, html, page_num, driver_second):
        '''
        雲南省政府採購網_採購結果解析器
        :param html: 網頁原始碼
        :param page_num: 網頁頁數
        :return: 返回包含資料字典的列表
        '''
        def resolver_pp_1(url_second):
            '''
            子網頁解析器_1
            eg: https://www.ynggzy.com/jyxx/jsgcZbjggsDetail?guid=fbd514af-5716-4e30-bc1d-b42892986f85&isOther=false
            :param html:網頁原始碼
            :return:中標公司和中標價格
            '''
            try:
                driver_second.get(url_second)
                people = driver_second.find_element_by_id('winSupply').get_attribute('value')
                price_tmp = driver_second.find_element_by_id('winMoney').get_attribute('value')
                price = price_tmp + "萬元"
                return people, price
            except:
                return None, ''
        if page_num != 0:
            print("[+] 正在解析第{0}頁資訊".format(page_num))
        # 儲存的列表
        resolveResult = []
        text = etree.HTML(html)
        for i in range(0, 10):
            node_list = text.xpath("//tr[@data-row-id='{0}']".format(i))

            for node in node_list:
                text_total = node.xpath('./td')[0].xpath('./a')[0].text

                # 編號
                num = text_total[:text_total.find('：')]

                # 工程名稱
                project_name = text_total[text_total.find('：') + 1:]

                # 區劃
                area = node.xpath('./td')[2].text

                time_push = None
                # 釋出時間
                timePush = node.xpath('./td')[3].text
                if timePush is not None:
                    time_push = self.time_parse(timePush)

                # 連結
                cursor = node.xpath('./td')[0].xpath('./a/@data-bulletin_id')[0]

                href = "http://www.yngp.com/newbulletin_zz.do?method=preinsertgomodify&operator_state=1&flag=view&bulletin_id={0}".format(
                    cursor)

                people, price = resolver_pp_1(href)

                # 儲存到字典
                resolveMessage = {
                    "編號": num,
                    "工程名稱": project_name,
                    "區劃": area,
                    "釋出時間": time_push,
                    "中標公司": people,
                    "中標價格": price,
                    "連結": href,
                    "推送": False
                }
                resolveResult.append(resolveMessage)
        return resolveResult

[Python] [爬蟲] 6.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——網頁解析器

目錄 1.Intro 2.Source 1.Intro 檔名：pageResolver.py 模組名：網頁解析器引用庫： re lxml datetime sys retry

[Python] [爬蟲] 1.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲概要——脫離Scrapy框架

目錄 1.Intro 2.Details 3.Theory 4.Environment and Configuration 5.Automation 6.Conclusion 1.Intro 作為Python的擁蹩，開源支持者，深信Python大

[Python] [爬蟲] 10.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——排程引擎

目錄 1.Intro 2.Source 1.Intro 檔名：scheduleEngine.py 模組名：排程引擎引用庫： random time gc os sys date

[Python] [爬蟲] 9.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——爬蟲日誌

目錄 1.Intro 2.Source 1.Intro 檔名：spiderLog.py 模組名：爬蟲日誌引用庫： logging 功能：日誌寫入到文字，包含普通訊息、警告、錯誤、異常等，可以跟蹤爬蟲執行過程。 &nb

[Python] [爬蟲] 8.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——資料推送模組

目錄 1.Intro 2.Source (1)dataPusher (2)dataPusher_HTML 1.Intro 檔名：dataPusher.py、dataPusher_HTML.py 模組名：資料推送模組引用庫： smtpl

[Python] [爬蟲] 7.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——資料處理器

目錄 1.Intro 2.Source 1.Intro 檔名：dataDisposer.py 模組名：資料處理器引用庫： pymongo datetime time sys

[Python] [爬蟲] 5.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——網頁下載器

目錄 1.Intro 2.Source 1.Intro 檔名：pageDownloader.py 模組名：網頁下載器引用庫： selenium random sys socket tim

[Python] [爬蟲] 4.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——配置管理器

目錄 1.Intro 2.Source 1.Intro 檔名：configManager.py 模組名：配置管理器引用庫：None 功能：儲存爬蟲相關配置資訊，如資料庫配置、資料表名、網站URL、報頭等。 2.Source #!/usr/bin/env Py

[Python] [爬蟲] 3.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——代理池

目錄 1.Intro 2.Source 1.Intro 檔名：proxyPool.py 模組名：代理池引用庫： requests urllib2 lxml scrapy pymongo

[Python] [爬蟲] 2.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——驗證模組

目錄 1.Intro 2.Source 1.Intro 檔名：authentication.py 模組名：驗證模組引用庫： urllib2 requests pymongo socket

[Python] [爬蟲] 11.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——日誌監控

目錄 1.Intro 檔名：log_record.py 模組名：日誌監控引用庫： pymongo 功能：爬蟲執行結果寫入到資料庫的日誌表中，便於檢視每天執行情況，執行失敗時再追溯日誌。 2.Source #!/usr/bin/env pytho

[Python] [爬蟲] 12.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——代理池重建

目錄 1.Intro 檔名：rebuild_proxy.py 模組名：代理池重建引用庫： pymongo random 自定義引用檔案：proxyPool、configManager 功能：清空代理池，重新爬取代理，提高代理可用性。 2.So

Python爬蟲系列 - 初探：爬取新聞推送

http nec apple 下標 for pri Language span round Get發送內容格式 Get方式主要需要發送headers、url、cookies、params等部分的內容。 t = requests.get(url, headers = hea

知網摘要作者資訊爬取和搜狗微信、搜狗新聞的爬蟲

個人專案，只支援python3. 需要說明的是，本文中介紹的都是小規模資料的爬蟲（資料量<1G），大規模爬取需要會更復雜，本文不涉及這一塊。另外，程式碼細節就不過多說了，只將一個大概思路以及趟過的

BOSS直聘網站資料分析崗位資訊爬取

感謝BOSS直聘上比較可靠的招聘資訊，讓我們有機會對資料分析崗位進行簡單的爬取與分析。語言：Python3 目錄一、資訊爬取二、資料分析 2.1 資料解析 2.2 資料分析 2.2.1 資料清洗

[Python爬蟲] 爬蟲例項:獲取政府網站公示資料並儲存到MongoDB資料庫

前言在上一篇文章 https://blog.csdn.net/xHibiki/article/details/84134554 中,我們介紹了Mongo資料庫以及管理工具Studio3T和admin

Python爬蟲scrapy框架爬取動態網站——scrapy與selenium結合爬取資料

scrapy框架只能爬取靜態網站。如需爬取動態網站，需要結合著selenium進行js的渲染，才能獲取到動態載入的資料。如何通過selenium請求url，而不再通過下載器Downloader去請求這個url?方法：在request物件通過中介軟體的時候，在中介軟體內部開始

Python - 爬蟲爬取和登陸github

用API搜尋GitHub中star數最多的前十個庫，並用post方法登陸並點選收藏一用API搜尋GitHub中star數最多的前十個庫利用GitHub提供的API爬取前十個star數量最多的Python庫 GitHub提供了很多專門為爬蟲準

Python爬蟲實習筆記 | Week3 資料爬取和正則再學習

2018/10/29 1.所思所想：雖然自己的考試在即，但工作上不能有半點馬虎，要認真努力，不辜負期望。中午和他們去吃飯，算是吃飯創新吧。下午爬了雞西的網站，還有一些欄位沒爬出來，正則用的不熟悉，此時終於露出端倪，心情不是很好。。明天上午把正則好好看看。 2.工作： [1].哈爾濱：html p

python爬蟲實踐——零基礎快速入門（二）爬取豆瓣電影

爬蟲又稱為網頁蜘蛛，是一種程式或指令碼。但重點在於，它能夠按照一定的規則，自動獲取網頁資訊。爬蟲的基本原理——通用框架 1.挑選種子URL； 2.講這些URL放入帶抓取的URL列隊； 3.取出帶抓取的URL，下載並存儲進已下載網頁庫中。此外，講這些URL放入帶抓取UR

[Python] [爬蟲] 6.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——網頁解析器

1.Intro

2.Source

相關推薦