[Python] [爬蟲] 8.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——資料推送模組

阿新 • • 發佈：2018-11-09

1.Intro

2.Source

(1)dataPusher

(2)dataPusher_HTML

1.Intro

檔名：dataPusher.py、dataPusher_HTML.py

模組名：資料推送模組

引用庫：

smtplib	email	pyExcelerator
sys	time	datetime

自定義引用檔案：dataDisposer、Console_Color、configManager

功能：從資料庫中獲取資料生成HTML檔案，更新推送標識，格式化郵件地址，傳送郵件。

2.Source

(1) dataPusher

#!/usr/bin/env Python
# -*- coding: utf-8 -*-
'''
# Author  : YSW
# Time    : 2018/6/6 14:05
# File    : dataPusher.py
# Version : 1.0
# Describe: 資料推送模組（舊版本推送方式）
# Update  :
'''

'''
    smtplib模組主要負責傳送郵件：
        是一個傳送郵件的動作，連線郵箱伺服器，登入郵箱，傳送郵件（有發件人，收信人，郵件內容）。
    
    email模組主要負責構造郵件：
        指的是郵箱頁面顯示的一些構造，如發件人，收件人，主題，正文，附件等。
    
    xlwt模組：
        操作excel
    
    pyExcelerator模組：
        操作excel，寫入excel較為方便
    
'''
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from email.header import Header
from email import encoders
from email.mime.base import MIMEBase
from email.utils import parseaddr, formataddr
import time
from pyExcelerator import *

class DataWrite(object):
    def __init__(self):
        print("[*] 正在初始化資料寫入模組")
        self.excel_Workbook = Workbook()
        self.excel_Workbook_parse = Workbook()

    def excel_name(self, logic_file_type):
        '''
        獲取當前時間，生成excel檔名
        檔名格式為：
            年月日_時分秒
            如：20180619_161819
        :return: excel檔名
        '''
        print("[+] 正在建立檔名稱")
        current_time = time.strftime('%Y%m%d %H:%M:%S', time.localtime(time.time())).replace(' ', '_').replace(':', '')
        file_name = ""

        if logic_file_type == 0:
            file_name = r".\history_file\{0}.xls".format(current_time)
        elif logic_file_type == 1:
            file_name = r".\history_file\{0}[keyword].xls".format(current_time)
        elif logic_file_type == 2:
            file_name = r".\history_file\{0}_ZB.xls".format(current_time)
        elif logic_file_type == 3:
            file_name = r".\history_file\{0}_ZB[keyword].xls".format(current_time)

        print("[+] 建立成功")
        return file_name

    def excel_header(self, row, excel_sheet, excel_head_data, excel_sheet_name):
        '''
        生成excel標題頭
        :param row: 當前標題的行數
        :param excel_sheet: 當前excel中的表
        :param excel_head_data: 標題列表資料
        :param excel_sheet_name: 表名
        :return:
        '''
        print("[*] 正在寫入標題，表名：{0}".format(excel_sheet_name))
        try:
            index = 0
            for data in excel_head_data:
                excel_sheet.write(row, index, data)
                index += 1
            print("[+] 寫入標題成功")
            return True
        except Exception, e:
            print("[-] 寫入標題失敗")
            print("ERROR: " + str(e.message))
            return False

    def excel_write(self, excel_sheet_name, excel_head_data, excel_data, logic_file_type):
        '''
        excel檔案寫入
        :param excel_sheet_name: excel的sheet表名
        :param excel_head_data: excel的標題列表資料
        :param excel_data: 要寫入excel的資料
        :param logic_file_type: 判斷檔案是否為關鍵詞提取檔案
        :return: 返回生成的excel檔案地址
        '''
        excel_name = self.excel_name(logic_file_type)
        try:
            print("[*] 正在寫入檔案")
            # 在excel檔案中對應生成每一張表
            excel_sheet = self.excel_Workbook.add_sheet(excel_sheet_name)

            if self.excel_header(0, excel_sheet, excel_head_data, excel_sheet_name):
                index = 1
                for data in excel_data:
                    column_index = 0
                    for item in excel_head_data:
                        excel_sheet.write(index, column_index, data[item])
                        column_index += 1
                    index += 1
                self.excel_Workbook.save(excel_name)
            print("[+] 寫入檔案成功")
            return excel_name
        except Exception, e:
            print("[-] 寫入檔案失敗")
            print("ERROR: " + str(e.message))
            return excel_name

    def excel_write_parse(self, excel_sheet_name, excel_head_data, excel_data, logic_file_type):
        '''
        excel檔案寫入（篩選後）
        :param excel_sheet_name: excel的sheet表名
        :param excel_head_data: excel的標題列表資料
        :param excel_data: 要寫入excel的資料
        :param logic_file_type: 判斷檔案是否為關鍵詞提取檔案
        :return: 返回生成的excel檔案地址
        '''
        excel_name = self.excel_name(logic_file_type)
        try:
            print("[*] 正在寫入檔案")
            # 在excel檔案中對應生成每一張表
            excel_sheet = self.excel_Workbook_parse.add_sheet(excel_sheet_name)

            if self.excel_header(0, excel_sheet, excel_head_data, excel_sheet_name):
                index = 1
                for data in excel_data:
                    column_index = 0
                    for item in excel_head_data:
                        excel_sheet.write(index, column_index, data[item])
                        column_index += 1
                    index += 1
                self.excel_Workbook_parse.save(excel_name)
            print("[+] 寫入檔案成功")
            return excel_name
        except Exception, e:
            print("[-] 寫入檔案失敗")
            print("ERROR: " + str(e.message))
            return excel_name

class DataSend(object):
    def __init__(self):
        print("[*] 正在初始化資料推送模組")

    def format_address(self, address):
        '''
        格式化郵件地址
        :param address: 郵件地址
        :return: 格式化後的郵件地址
        '''
        print("[+] 正在格式化郵件地址")
        name, addr = parseaddr(address)
        print("[+] 格式化完成")
        return formataddr((Header(name, 'utf-8').encode(), addr))

    def send_mail(self, body, attachment):
        '''
        傳送郵件
        :param body: 郵件正文
        :param attachment: 附件地址
        :return: 傳送成功返回True
        '''
        print("[+] 開始傳送郵件...")
        # 要傳送的伺服器
        smtp_server = 'smtp.qq.com'
        # 要傳送的郵箱使用者名稱/密碼
        from_mail = '傳送方郵箱地址'
        mail_pass = '郵箱SMTP服務密碼'
        # 接收的郵箱
        to_mail = '接收方郵箱地址'

        # 構造一個 MIMEMultipart 物件代表郵件本身
        msg = MIMEMultipart()

        # Header 對中文進行轉碼
        msg['From'] = self.format_address('爬蟲機器人 <%s>' % from_mail).encode()
        msg['To'] = to_mail
        msg['Subject'] = Header('今日份的招投標資訊', 'utf-8').encode()

        # # plain 代表純文字
        msg.attach(MIMEText(body, 'plain', 'utf-8'))
        # 二進位制方式模式檔案
        if len(attachment) != 0:
            for file_path in attachment:
                with open(file_path, 'rb') as excel:
                    # MIMEBase 表示附件的名字
                    mime = MIMEBase(file_path[str(file_path).rfind('\\') + 1: -4], 'xls',
                                    filename=file_path[str(file_path).rfind('\\') + 1:])

                    # filename 是顯示附件名字
                    mime.add_header('Content-Disposition', 'attachment',
                                    filename=file_path[str(file_path).rfind('\\') + 1:])

                    # 獲取附件內容
                    mime.set_payload(excel.read())
                    encoders.encode_base64(mime)

                    # 作為附件新增到郵件
                    msg.attach(mime)

        print("[+] 正在連線 SMTP 伺服器")
        email = smtplib.SMTP_SSL(smtp_server, 465)
        print("[+] 連線成功")
        print("[+] 正在授權 SMTP 服務")
        login_code = email.login(from_mail, mail_pass)
        if login_code[0] is 235:
            print("[+] 授權成功")
        else:
            print("[-] 授權失敗")
            return False
        try:
            # as_string()把 MIMEText 物件變成 str
            print("[+] 正在傳送郵件")
            email.sendmail(from_mail, to_mail, msg.as_string())
            email.quit()
            print("[+] 傳送成功")
            return True
        except Exception as e:
            print("[-] 傳送失敗")
            print("ERROR: " + str(e.message))
            return False

(2) dataPusher_HTML

#!/usr/bin/env Python
# -*- coding: utf-8 -*-
'''
# Author  : YSW
# Time    : 2018/8/14 14:05
# File    : dataPusher_HTML.py
# Version : 1.0
# Describe: 資料推送模組（HTML版）
# Update  :
'''

import sys
import time
from Lib import Console_Color
import configManager
import dataDisposer
import datetime
reload(sys)
sys.setdefaultencoding('utf-8')

# 關鍵詞列表
KEY_WORD = []
# 表的標題名
TABLE_TITLE = configManager.table_title
TENDER = dataDisposer.tenderDB

# 資料庫
TENDER_TABLE = dataDisposer.DataOperate.dataOperate()

# 時間
DATE = dataDisposer.current_time()
TODAY_TIME = datetime.datetime(DATE.year, DATE.month, DATE.day, 0, 0, 0)


class HTML_Content(object):
    def __init__(self):
        Console_Color.print_color("[*] 正在初始化HTML資料寫入模組")

    def get_data(self, table_name):
        '''
        資料獲取函式
        :param table_name: 表名
        :return: 返回資料列表
        '''
        tenderTable = TENDER_TABLE[table_name]
        # 獲取今日資料
        list_data = list(tenderTable.find(
            {
                '釋出時間': {"$gte": TODAY_TIME},
                # '推送': False
            })
        )
        tenderTable.update(
            {'推送': False},
            {'$set': {'推送': True}},
            multi=True,
            upsert=True
        )
        return list_data

    def delete_data(self, table_name):
        '''
        移除連結為空的資料行
        :param table_name: 資料表名稱
        '''
        sheet = TENDER[table_name]
        sheet.remove({"連結": None})

    def current_time(self):
        time_parse = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
        return time_parse

    def html_name(self, logic_file_type):
        '''
        獲取當前時間，生成 html 檔名
        檔名格式為：
        年月日_時分秒
        如：20180619_161819
        :return: html 檔名
        '''
        Console_Color.print_color("[+] 正在建立檔名稱")
        current_time = time.strftime('%Y%m%d %H:%M:%S', time.localtime(time.time())).replace(' ', '_').replace(':', '')
        file_name = ""
        if logic_file_type == 0:
            file_name = r".\history_file\{0}.html".format(current_time)
        elif logic_file_type == 1:
            file_name = r".\history_file\{0}[keyword].html".format(current_time)
        elif logic_file_type == 2:
            file_name = r".\history_file\{0}_ZB.html".format(current_time)
        elif logic_file_type == 3:
            file_name = r".\history_file\{0}_ZB[keyword].html".format(current_time)
        Console_Color.print_color("[+] 建立成功")
        return file_name

    def __html_1(self, title, name):
        '''
        HTML網頁第一部分
        :param title: 網頁標題，如 “招投標資訊”
        :param name: 當前網頁名稱，如 “今日份的招投標檔案”
        :param desc: 描述資訊
        :return: 返回網頁第一部分資訊
        '''
        desc = "推送時間：{0}".format(self.current_time())
        html1 = """
        <html>
        <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <title>{0}</title></head><body bgcolor="white">
        </head>
        <body>
        <Center><H2>{1}</h2></Center>
        <p align="center">{2}</p>
        <Hr width="100%">
        <BR>
        """.format(title, name, desc)
        return html1

    def __html_content_header(self, current_website_name):
        '''
        分隔每個網站的標題頭
        :param current_website_name: 標題名稱
        :return: 帶標題名稱的網頁資訊
        '''
        Console_Color.print_color("[+] 建立網站標題頭")
        html_header = """
        <hr width="100%" style="margin-top:-5px;border:3px solid blue;"/>
        <h3>{0}</h3>
        """.format(current_website_name)
        return html_header

    def __html_a(self, url, time_parse, name, dict_data):
        '''
        主要內容
        :param url: 子連結地址
        :param name: 標題
        :param time_parse: 時間
        :param *args: 其他內容
        :return: 返回主要內容
        '''
        Console_Color.print_color("[+] 寫入主要內容: {0}".format(name))
        html_a = """
        <Hr width="100%">
        ├─<a>[{1}] #### </a><a href="{0}" target="_blank">{2}</a><br>
        """.format(url, time_parse, name)
        html_a_second = """"""
        for key, value in dict_data.items():
            html_a_second_tmp = """
            ├───────<a>{0}</a><br>
            """.format("{0}: {1}".format(key, value))
            html_a_second += html_a_second_tmp

        html = html_a + html_a_second + "<Hr width='100%'>"
        return html

    # Fixed
    def __html2(self):
        '''
        HTML網頁第二部分
        :return: 返回網頁第二部分資訊
        '''
        html2 = """
        </body>
        </html>
        """
        return html2

    def html_content_func(self, list_data, current_website_name):
        '''
        網頁主內容方法
        :param list_data: 資料列表
        :param current_website_name: 當前網站名稱
        :return: 返回頁面資料
        '''
        print("[*] 正在寫入網頁資料")
        html_content = self.__html_content_header(current_website_name)
        for data in list_data:
            url = str(data[u"連結"]).encode('utf-8')
            data.pop(u"連結")
            try:
                project_name = str(data[u"工程名稱"]).encode('utf-8')
                data.pop(u"工程名稱")
            except KeyError:
                try:
                    project_name = str(data[u"公告標題"]).encode('utf-8')
                    data.pop(u"公告標題")
                except KeyError:
                    project_name = str(data[u"公告名稱"]).encode('utf-8')
                    data.pop(u"公告名稱")

            time_parse = str(data[u"釋出時間"]).encode('utf-8')
            data.pop(u"釋出時間")
            data.pop(u"_id")
            data.pop(u"推送")
            html_content += self.__html_a(url, time_parse, project_name, data) + '\n'
        Console_Color.print_color("[+] 寫入完成")
        return html_content

    def html_engine(self, title, name, html_content):
        '''
        HTML生成器
        :param title: 網頁標題，如 “招投標資訊”
        :param name: 當前網頁名稱，如 “今日份的招投標檔案”
        :param current_website_name: 當前網站的標題名稱，如 “雲南省公共資源交易中心電子服務系統_工程建設”
        :param html_content: 當前網站的主要內容
        :return: 全網頁
        '''
        Console_Color.print_color("[*] 正在生成HTML頁面")
        html = \
                self.__html_1(title, name) \
                + "\n" \
                + html_content \
                + "\n" \
                + self.__html2()
        Console_Color.print_color("[+] 生成成功")
        return html

    def html_write(self, title, name, dict_html_data_name, logic_file_type):
        '''
        HTML 檔案寫入方法
        :param title: 網頁標題
        :param name: 當前網頁的名稱
        :param func: 資料獲取的方法
        :param list_html_data_name: 包含資料庫表名和網站名稱的字典
        :param logic_file_type: 檔案標識
        :return html檔案路徑
        '''
        html_file_name = self.html_name(logic_file_type)
        html_con = """"""
        for table_name, table_value in dict_html_data_name.items():
            self.delete_data(table_name)
            current_website_name = table_value
            list_data = self.get_data(table_name)
            if list_data == []:
                continue
            html_content = self.html_content_func(list_data, current_website_name)
            html_con += html_content
        if html_con == """""":
            return ''
        html = self.html_engine(title, name, html_con)
        with open(html_file_name, "w") as f:
            f.write(html)
        return html_file_name

    def html_write_keywords(self, title, name, dict_html_data_name, logic_file_type):
        '''
        HTML 檔案寫入方法（加入關鍵詞篩選）
        :param title: 網頁標題
        :param name: 當前網頁的名稱
        :param func: 資料獲取的方法
        :param list_html_data_name: 包含資料庫表名和網站名稱的字典
        :param logic_file_type: 檔案標識
        :return html檔案路徑
        '''
        html_file_name = self.html_name(logic_file_type)
        html_con = """"""
        for table_name, table_value in dict_html_data_name.items():
            self.delete_data(table_name)
            current_website_name = table_value
            list_data = self.get_data(table_name)
            # 讀取關鍵詞檔案並生成關鍵字列表
            with open(r".\keyword_file\keyword.txt", 'r') as f:
                line = f.read()
                if line not in KEY_WORD:
                    KEY_WORD.append(line)
            key_word = str(KEY_WORD[0]).split('\n')

            # 篩選關鍵詞資訊
            list_data_parse = []
            for data in list_data:
                for key in key_word:
                    # 獲取每張表對應的標題欄位並判斷是否包含關鍵詞資訊
                    if key in data[TABLE_TITLE[table_name]] and data not in list_data_parse:
                        list_data_parse.append(data)
            if list_data_parse == []:
                continue
            html_content = self.html_content_func(list_data_parse, current_website_name)
            html_con += html_content
        if html_con == """""":
            return ''
        html = self.html_engine(title, name, html_con)
        with open(html_file_name, "w") as f:
            f.write(html)
        return html_file_name

[Python] [爬蟲] 8.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——資料推送模組

目錄 1.Intro 2.Source (1)dataPusher (2)dataPusher_HTML 1.Intro 檔名：dataPusher.py、dataPusher_HTML.py 模組名：資料推送模組引用庫： smtpl

[Python] [爬蟲] 1.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲概要——脫離Scrapy框架

目錄 1.Intro 2.Details 3.Theory 4.Environment and Configuration 5.Automation 6.Conclusion 1.Intro 作為Python的擁蹩，開源支持者，深信Python大

[Python] [爬蟲] 10.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——排程引擎

目錄 1.Intro 2.Source 1.Intro 檔名：scheduleEngine.py 模組名：排程引擎引用庫： random time gc os sys date

[Python] [爬蟲] 9.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——爬蟲日誌

目錄 1.Intro 2.Source 1.Intro 檔名：spiderLog.py 模組名：爬蟲日誌引用庫： logging 功能：日誌寫入到文字，包含普通訊息、警告、錯誤、異常等，可以跟蹤爬蟲執行過程。 &nb

[Python] [爬蟲] 7.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——資料處理器

目錄 1.Intro 2.Source 1.Intro 檔名：dataDisposer.py 模組名：資料處理器引用庫： pymongo datetime time sys

[Python] [爬蟲] 6.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——網頁解析器

目錄 1.Intro 2.Source 1.Intro 檔名：pageResolver.py 模組名：網頁解析器引用庫： re lxml datetime sys retry

[Python] [爬蟲] 5.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——網頁下載器

目錄 1.Intro 2.Source 1.Intro 檔名：pageDownloader.py 模組名：網頁下載器引用庫： selenium random sys socket tim

[Python] [爬蟲] 4.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——配置管理器

目錄 1.Intro 2.Source 1.Intro 檔名：configManager.py 模組名：配置管理器引用庫：None 功能：儲存爬蟲相關配置資訊，如資料庫配置、資料表名、網站URL、報頭等。 2.Source #!/usr/bin/env Py

[Python] [爬蟲] 3.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——代理池

目錄 1.Intro 2.Source 1.Intro 檔名：proxyPool.py 模組名：代理池引用庫： requests urllib2 lxml scrapy pymongo

[Python] [爬蟲] 2.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——驗證模組

目錄 1.Intro 2.Source 1.Intro 檔名：authentication.py 模組名：驗證模組引用庫： urllib2 requests pymongo socket

[Python] [爬蟲] 11.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——日誌監控

目錄 1.Intro 檔名：log_record.py 模組名：日誌監控引用庫： pymongo 功能：爬蟲執行結果寫入到資料庫的日誌表中，便於檢視每天執行情況，執行失敗時再追溯日誌。 2.Source #!/usr/bin/env pytho

[Python] [爬蟲] 12.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——代理池重建

目錄 1.Intro 檔名：rebuild_proxy.py 模組名：代理池重建引用庫： pymongo random 自定義引用檔案：proxyPool、configManager 功能：清空代理池，重新爬取代理，提高代理可用性。 2.So

知網摘要作者資訊爬取和搜狗微信、搜狗新聞的爬蟲

個人專案，只支援python3. 需要說明的是，本文中介紹的都是小規模資料的爬蟲（資料量<1G），大規模爬取需要會更復雜，本文不涉及這一塊。另外，程式碼細節就不過多說了，只將一個大概思路以及趟過的

[Python爬蟲] 爬蟲例項:獲取政府網站公示資料並儲存到MongoDB資料庫

前言在上一篇文章 https://blog.csdn.net/xHibiki/article/details/84134554 中,我們介紹了Mongo資料庫以及管理工具Studio3T和admin

python入門8 字符串拼接、格式化輸出

格式化輸出 print ftime inpu 連接 port ack imp 字符串拼接方式 1 使用 + 拼接字符串 2 格式化輸出：%s字符串 %d整數 %f浮點數 %%輸出% %X-16進制 %r-原始字符串 3 str.format() 代碼如下：

Python爬蟲scrapy框架爬取動態網站——scrapy與selenium結合爬取資料

scrapy框架只能爬取靜態網站。如需爬取動態網站，需要結合著selenium進行js的渲染，才能獲取到動態載入的資料。如何通過selenium請求url，而不再通過下載器Downloader去請求這個url?方法：在request物件通過中介軟體的時候，在中介軟體內部開始

8.5高階函數、遞歸函數和內置函數

內置函數遞歸函數高階函數 Python 高階函數、遞歸函數和內置函數高階函數和遞歸函數 #函數名可以進行賦值，可以作為函數參數，可以作為返回值 #高階函數：允許導入函數作為參數導入或者返回值為函數 def f(n): return n*n def fun(a,b,fun1):

Python - 爬蟲爬取和登陸github

用API搜尋GitHub中star數最多的前十個庫，並用post方法登陸並點選收藏一用API搜尋GitHub中star數最多的前十個庫利用GitHub提供的API爬取前十個star數量最多的Python庫 GitHub提供了很多專門為爬蟲準

Python爬蟲實習筆記 | Week3 資料爬取和正則再學習

2018/10/29 1.所思所想：雖然自己的考試在即，但工作上不能有半點馬虎，要認真努力，不辜負期望。中午和他們去吃飯，算是吃飯創新吧。下午爬了雞西的網站，還有一些欄位沒爬出來，正則用的不熟悉，此時終於露出端倪，心情不是很好。。明天上午把正則好好看看。 2.工作： [1].哈爾濱：html p

python爬蟲實踐——零基礎快速入門（二）爬取豆瓣電影

爬蟲又稱為網頁蜘蛛，是一種程式或指令碼。但重點在於，它能夠按照一定的規則，自動獲取網頁資訊。爬蟲的基本原理——通用框架 1.挑選種子URL； 2.講這些URL放入帶抓取的URL列隊； 3.取出帶抓取的URL，下載並存儲進已下載網頁庫中。此外，講這些URL放入帶抓取UR

[Python] [爬蟲] 8.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——資料推送模組

1.Intro

2.Source

(1) dataPusher

(2) dataPusher_HTML

相關推薦