使用Python3將Markdown（.md）文字轉換成 html、pdf

阿新 • • 發佈：2018-12-02

一、Markdown中不同的文字內容會分成不同的文字塊，並通過markdown的語法控制進行文字的拼接，組成新的檔案。

二、利用Python3實現（.md）檔案轉換成（.html）檔案

　　在cmd命令列下進入（.py）檔案目錄下，使用命令進行執行

　　>python md2html.py <file.md> <file.html>

import sys, re

#生成器模組
def lines(file):
    #在文字最後加一空行
    for line in file: yield line
    yield 
 '\n'

def blocks(file):
    #生成單獨的文字塊
    block = []
    for line in lines(file):
        if line.strip():
            block.append(line)
        elif block:
            yield ''.join(block).strip()
            block = []

#文字塊處理程式
class Handler:
    """
    處理程式父類
    """
    def callback(self, prefix, name, *args):
        method  
= getattr(self, prefix + name, None)
        if callable(method): return method(*args)

    def start(self, name):
        self.callback('start_', name)

    def end(self, name):
        self.callback('end_', name)

    def sub(self, name):
        def substitution(match):
            result  
= self.callback('sub_', name, match)
            if result is None: result = match.group(0)
            return result
        return substitution

class HTMLRenderer(Handler):
    """
    HTML處理程式,給文字塊加相應的HTML標記
    """
    def start_document(self):
        print('<html><head><title>Python文字解析</title></head><body>')

    def end_document(self):
        print('</body></html>')

    def start_paragraph(self):
        print('<p style="color: #444;">')

    def end_paragraph(self):
        print('</p>')

    def start_heading(self):
        print('<h2 style="color: #68BE5D;">')

    def end_heading(self):
        print('</h2>')

    def start_list(self):
        print('<ul style="color: #363736;">')

    def end_list(self):
        print('</ul>')

    def start_listitem(self):
        print('<li>')

    def end_listitem(self):
        print('</li>')

    def start_title(self):
        print('<h1 style="color: #1ABC9C;">')

    def end_title(self):
        print('</h1>')

    def sub_emphasis(self, match):
        return('<em>%s</em>' % match.group(1))

    def sub_url(self, match):
        return('<a target="_blank" style="text-decoration: none;color: #BC1A4B;" href="%s">%s</a>' % (match.group(1), match.group(1)))

    def sub_mail(self, match):
        return('<a style="text-decoration: none;color: #BC1A4B;" href="mailto:%s">%s</a>' % (match.group(1), match.group(1)))

    def feed(self, data):
        print(data)


#規則，判斷每個文字塊應該如何處理
class Rule:
    """
    規則父類
    """
    def action(self, block, handler):
        """
        加標記
        """
        handler.start(self.type)
        handler.feed(block)
        handler.end(self.type)
        return True

class HeadingRule(Rule):
    """
    一號標題規則
    """
    type = 'heading'
    def condition(self, block):
        """
        判斷文字塊是否符合規則
        """
        return not '\n' in block and len(block) <= 70 and not block[-1] == ':'

class TitleRule(HeadingRule):
    """
    二號標題規則
    """
    type = 'title'
    first = True

    def condition(self, block):
        if not self.first: return False
        self.first = False
        return HeadingRule.condition(self, block)

class ListItemRule(Rule):
    """
    列表項規則
    """
    type = 'listitem'
    def condition(self, block):
        return block[0] == '-'

    def action(self, block, handler):
        handler.start(self.type)
        handler.feed(block[1:].strip())
        handler.end(self.type)
        return True

class ListRule(ListItemRule):
    """
    列表規則
    """
    type = 'list'
    inside = False
    def condition(self, block):
        return True

    def action(self, block, handler):
        if not self.inside and ListItemRule.condition(self, block):
            handler.start(self.type)
            self.inside = True
        elif self.inside and not ListItemRule.condition(self, block):
            handler.end(self.type)
            self.inside = False
        return False

class ParagraphRule(Rule):
    """
    段落規則
    """
    type = 'paragraph'

    def condition(self, block):
        return True

class Code(Rule):
    '''
    程式碼框規則
    高亮顯示規則
    。。。
    '''
    pass


# 對整個文字進行解析
class Parser:
    """
    解析器父類
    """
    def __init__(self, handler):
        self.handler = handler
        self.rules = []
        self.filters = []

    def addRule(self, rule):
        """
        新增規則
        """
        self.rules.append(rule)

    def addFilter(self, pattern, name):
        """
        新增過濾器
        """
        def filter(block, handler):
            return re.sub(pattern, handler.sub(name), block)
        self.filters.append(filter)

    def parse(self, file):
        """
        解析
        """
        self.handler.start('document')
        for block in blocks(file):
            for filter in self.filters:
                block = filter(block, self.handler)
            for rule in self.rules:
                if rule.condition(block):
                    last = rule.action(block, self.handler)
                    if last: break
        self.handler.end('document')

class BasicTextParser(Parser):
    """
    純文字解析器
    """
    def __init__(self, handler):
        Parser.__init__(self, handler)
        self.addRule(ListRule())
        self.addRule(ListItemRule())
        self.addRule(TitleRule())
        self.addRule(HeadingRule())
        self.addRule(ParagraphRule())

        self.addFilter(r'\*(.+?)\*', 'emphasis')
        self.addFilter(r'(http://[\.a-zA-Z/]+)', 'url')
        self.addFilter(r'([\.a-zA-Z][email protected][\.a-zA-Z]+[a-zA-Z]+)', 'mail')


"""
執行測試程式
"""
handler = HTMLRenderer()
parser = BasicTextParser(handler)
parser.parse(sys.stdin)

三、利用Python3將文字轉化成pdf檔案

　　命令>python md2pdf.py 原始檔目標檔案 [options]

Options:
    -h --help     show help document.
    -v --version  show version information.
    -o --output   translate sourcefile into html file.
    -p --print    translate sourcefile into pdf file and html file respectively.
    -P --Print    translate sourcefile into pdf file only.

import os,re
import sys,getopt
from enum import Enum
from subprocess import call
from functools import reduce

from docopt import docopt

__version__ = '1.0'

# 定義三個列舉類
# 定義表狀態
class TABLE(Enum):
    Init = 1
    Format = 2
    Table = 3

# 有序序列狀態
class ORDERLIST(Enum):
    Init = 1
    List = 2

# 塊狀態
class BLOCK(Enum):
    Init = 1
    Block = 2
    CodeBlock = 3

# 定義全域性狀態，並初始化狀態
table_state = TABLE.Init
orderList_state = ORDERLIST.Init
block_state = BLOCK.Init
is_code = False
is_normal = True

temp_table_first_line = []
temp_table_first_line_str = ""

need_mathjax = False


def test_state(input):
    global table_state, orderList_state, block_state, is_code, temp_table_first_line, temp_table_first_line_str
    Code_List = ["python\n", "c++\n", "c\n"]

    result = input

    # 構建正則表示式規則
    # 匹配塊標識
    pattern = re.compile(r'```(\s)*\n')
    a = pattern.match(input)

    # 普通塊
    if  a and block_state == BLOCK.Init:
        result = "<blockquote>"
        block_state = BLOCK.Block
        is_normal = False
    # 特殊程式碼塊
    elif len(input) > 4 and input[0:3] == '```' and (input[3:9] == "python" or input[3:6] == "c++" or input[3:4]== "c") and block_state == BLOCK.Init:
        block_state = BLOCK.Block
        result = "<code></br>"
        is_code = True
        is_normal = False
    # 塊結束
    elif block_state == BLOCK.Block and input == '```\n':
        if is_code:
            result = "</code>"
        else:
            result = "</blockquote>"
        block_state = BLOCK.Init
        is_code = False
        is_normal = False
    elif block_state == BLOCK.Block:
        pattern = re.compile(r'[\n\r\v\f\ ]')
        result = pattern.sub("&nbsp", result)
        pattern = re.compile(r'\t')
        result = pattern.sub("&nbsp" * 4, result)
        result = "<span>" + result + "</span></br>"
        is_normal = False

    # 解析有序序列
    if len(input) > 2 and input[0].isdigit() and input[1] == '.' and orderList_state == ORDERLIST.Init:
        orderList_state = ORDERLIST.List
        result = "<ol><li>" + input[2:] + "</li>"
        is_normal = False
    elif len(input) > 2 and  input[0].isdigit() and input[1] == '.' and orderList_state == ORDERLIST.List:
        result = "<li>" + input[2:] + "</li>"
        is_normal = False
    elif orderList_state == ORDERLIST.List and (len(input) <= 2 or input[0].isdigit() == False or input[1] != '.'):
        result = "</ol>" + input
        orderList_state = ORDERLIST.Init

    # 解析表格
    pattern = re.compile(r'^((.+)\|)+((.+))$')
    match = pattern.match(input)
    if match:
        l = input.split('|')
        l[-1] = l[-1][:-1]
        # 將空字元彈出列表
        if l[0] == '':
            l.pop(0)
        if l[-1] == '':
            l.pop(-1)
        if table_state == TABLE.Init:
            table_state = TABLE.Format
            temp_table_first_line = l
            temp_table_first_line_str = input
            result = ""
        elif table_state == TABLE.Format:
            # 如果是表頭與表格主題的分割線
            if reduce(lambda a, b: a and b, [all_same(i,'-') for i in l], True):
                table_state = TABLE.Table
                result = "<table><thread><tr>"
                is_normal = False
                
                # 新增表頭
                for i in temp_table_first_line:
                    result += "<th>" + i + "</th>"
                result += "</tr>"
                result += "</thread><tbody>"
                is_normal = False
            else:
                result = temp_table_first_line_str + "</br>" + input
                table_state = TABLE.Init

        elif table_state == TABLE.Table:
            result = "<tr>"
            for i in l:
                result += "<td>" + i + "</td>"
            result += "</tr>"

    elif table_state == TABLE.Table:
        table_state = TABLE.Init
        result = "</tbody></table>" + result
    elif table_state == TABLE.Format:
        pass
    
    return result

#　判斷 lst 是否全由字元 sym 構成　
def all_same(lst, sym):
    return not lst or sym * len(lst) == lst

# 處理標題
def handleTitle(s, n):
    temp = "<h" + repr(n) + ">" + s[n:] + "</h" + repr(n) + ">"
    return temp

# 處理無序列表
def handleUnorderd(s):
    s = "<ul><li>" + s[1:]
    s += "</li></ul>"
    return s


def tokenTemplate(s, match):
    pattern = ""
    if match == '*':
        pattern = "\*([^\*]*)\*"
    if match == '~~':
        pattern = "\~\~([^\~\~]*)\~\~"
    if match == '**':
        pattern = "\*\*([^\*\*]*)\*\*"
    return pattern

# 處理特殊標識，比如 **, *, ~~
def tokenHandler(s):
    l = ['b', 'i', 'S']
    j = 0
    for i in ['**', '*', '~~']:
        pattern = re.compile(tokenTemplate(s,i))
        match = pattern.finditer(s)
        k = 0
        for a in match:
            if a:
                content = a.group(1)
                x,y = a.span()
                c = 3
                if i == '*':
                    c = 5
                s = s[:x+c*k] + "<" + l[j] + ">" + content + "</" + l[j] + ">" + s[y+c*k:]
                k += 1
        pattern = re.compile(r'\$([^\$]*)\$')
        a = pattern.search(s)
        if a:
            global need_mathjax
            need_mathjax = True
        j += 1
    return s

# 處理連結
def link_image(s):
    # 超連結
    pattern = re.compile(r'\\\[(.*)\]\((.*)\)')
    match = pattern.finditer(s)
    for a in match:
        if a:
            text, url = a.group(1,2)
            x, y = a.span()
            s = s[:x] + "<a href=" + url + " target=\"_blank\">" + text + "</a>" + s[y:]

    # 影象連結
    pattern = re.compile(r'!\[(.*)\]\((.*)\)')
    match = pattern.finditer(s)
    for a in match:
        if a:
            text, url = a.group(1,2)
            x, y = a.span()
            s = s[:x] + "<img src=" + url + " target=\"_blank\">" + "</a>" + s[y:]

    # 角標
    pattern = re.compile(r'(.)\^\[([^\]]*)\]')
    match = pattern.finditer(s)
    k = 0
    for a in match:
        if a:
            sym,index = a.group(1,2)
            x, y = a.span()
            s = s[:x+8*k] + sym + "<sup>" + index + "</sup>" + s[y+8*k:]
        k += 1

    return s


def parse(input):
    global block_state, is_normal
    is_normal = True
    result = input

    # 檢測當前 input 解析狀態
    result = test_state(input)
    
    if block_state == BLOCK.Block:
        return result

    # 分析標題標記 # 
    title_rank = 0
    for i in range(6, 0, -1):
        if input[:i] == '#'*i:
            title_rank = i
            break
    if title_rank != 0:
        # 處理標題，轉化為相應的 HTML 文字
        result = handleTitle(input, title_rank)
        return result

    # 分析分割線標記 --
    if len(input) > 2 and all_same(input[:-1], '-') and input[-1] == '\n':
        result = "<hr>"
        return result

    # 解析無序列表
    unorderd = ['+', '-']
    if result != "" and result[0] in unorderd :
        result = handleUnorderd(result)
        is_normal = False

    f = input[0]
    count = 0
    sys_q = False
    while f == '>':
        count += 1
        f = input[count]
        sys_q = True
    if sys_q:
        result = "<blockquote style=\"color:#8fbc8f\"> "*count + "<b>" + input[count:] + "</b>" + "</blockquote>"*count
        is_normal = False

    # 處理特殊標記，比如 ***, ~~~
    result = tokenHandler(result)

    # 解析影象連結
    result = link_image(result)
    pa = re.compile(r'^(\s)*$')
    a = pa.match(input)
    if input[-1] == "\n" and is_normal == True and not a :
        result+="</br>"

    return result 


def run(source_file, dest_file, dest_pdf_file, only_pdf):
    # 獲取檔名
    file_name = source_file
    # 轉換後的 HTML 檔名
    dest_name = dest_file
    # 轉換後的 PDF 檔名
    dest_pdf_name = dest_pdf_file

    # 獲取檔案字尾
    _, suffix = os.path.splitext(file_name)
    if suffix not in [".md",".markdown",".mdown","mkd"]:
        print('Error: the file should be in markdown format')
        sys.exit(1)

    if only_pdf:
        dest_name = ".~temp~.html"


    f = open(file_name, "r")
    f_r = open(dest_name, "w")

    # 往檔案中填寫 HTML 的一些屬性
    f_r.write("""<style type="text/css">div {display: block;font-family: "Times New Roman",Georgia,Serif}\
            #wrapper { width: 100%;height:100%; margin: 0; padding: 0;}#left { float:left; \
            width: 10%;  height: 100%;  }#second {   float:left;   width: 80%;height: 100%;   \
            }#right {float:left;  width: 10%;  height: 100%; \
            }</style><div id="wrapper"> <div id="left"></div><div id="second">""")
    f_r.write("""<meta charset="utf-8"/>""")
    
    # 逐行解析 markdwon 檔案
    for eachline in f:
        result = parse(eachline)
        if result != "":
            f_r.write(result)

    f_r.write("""</br></br></div><div id="right"></div></div>""")

    # 公式支援
    global need_mathjax
    if need_mathjax:
        f_r.write("""<script type="text/x-mathjax-config">\
        MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});\
        </script><script type="text/javascript" \
        src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>""")
    # 檔案操作完成之後記得關閉！！！
    f_r.close()
    f.close()

    # 呼叫擴充套件 wkhtmltopdf 將 HTML 檔案轉換成 PDF
    if dest_pdf_name != "" or only_pdf:
        call(["wkhtmltopdf", dest_name, dest_pdf_name])
    # 如果有必要，刪除中間過程生成的 HTML 檔案
    if only_pdf:
        call(["rm", dest_name])


# 主函式
def main():
    dest_file = "translation_result.html"
    dest_pdf_file = "translation_result.pdf"

    only_pdf = False

    args = docopt(__doc__, version=__version__)

    dest_file = args['<outputfile>'] if args['--output'] else dest_file

    dest_pdf_file = args['<outputfile>'] if args['--print'] or args['--Print'] else ""

    run(args['<sourcefile>'], dest_file, dest_pdf_file, args['--Print'])


if __name__=="__main__":
    main()

使用Python3將Markdown（.md）文字轉換成 html、pdf

一、Markdown中不同的文字內容會分成不同的文字塊，並通過markdown的語法控制進行文字的拼接，組成新的檔案。二、利用Python3實現（.md）檔案轉換成（.html）檔案　　在cmd命令列下進入（.py）檔案目錄下，使用命令進行執行　　>python md2htm

使用Python3將Markdown（.md）文本轉換成 html、pdf

isp break 段落 close all ict ddr tran 有序一、Markdown中不同的文本內容會分成不同的文本塊，並通過markdown的語法控制進行文本的拼接，組成新的文件。二、利用Python3實現（.md）文件轉換成（.html）文件

weka文字聚類（3）--文字轉換成arff

要使用weka進行聚類分析，必須先將文字資料轉換成weka可識別的arff格式。Instances類是weka可識別的資料類，其toString方法即可轉換為arff格式的資料。在文字聚類中，arff格式的示例如下： @relation patent @attrib

Markdown（editormd）語法解析成html

　　我們在一些網站中可以見到一款網頁編輯器——markdown；　　這是一款功能強大的富文字編輯器，之前自己在網頁上使用的時候遇到了一點點的問題，現在跟大家分享下　　在我們寫了文章之後是需要將內容儲存到資料庫的，如果儲存到資料庫中要方便以後需改的話，那麼需要儲存成mar

Hibernate執行原生sql時，將資料庫的char（n）型別轉換成了character型別的解決方案

在使用Hibernate的原生態SQL對Oracle進行查詢時，碰到查詢char型別的時候始終返回的是一個字元，開始認為應該是Hibernate在做對映的把資料型別給對映成char(1)，在經過查詢網上的一些資料，得知產生這個問題的主要原因確實是Hibernate再查詢Or

fastJSON使用（二）——json轉換成物件

package fastjsonstudy; import com.alibaba.fastjson.JSON; import com.alibaba.fastjson.TypeReference; import java.util.*; /** * Hello world! */ public c

簡單Python3爬蟲程式（1）簡單架構：佇列、集合、正則

<span style="font-size:18px;">import re import urllib.request import urllib from collections i

dojo小例子（12）form轉換成帶內部物件的json資料

假設有這樣一個form <div id="myform" data-dojo-type="dijit/form/Form"> <div>姓名<input name="name" data-dojo-type="dijit/form/Text

JSON 中的毫秒時間（ LONG），轉換成年月日

下面只有程式碼，很簡單的： package com.tujia.ecd.test; import java.util.Date; import java.util.GregorianCalendar; import net.sf.json.JSONObject; p

將word文件轉換為html、PDF等

在日常工作中我們常常要把資料匯入word後，在做列印功能，一般列印在前臺做的話會比在後臺做客戶體驗更好一些，這個時候交給前臺最好是html、pdf、或圖片格式的資料，我的另一篇部落格中講解了怎麼將PDF轉換成圖片，並且可以調整清晰度。這些方法都是我在工作學習中在網路上借鑑各位前輩的經驗

IOS開發（7）WKWebView載入本地HTML、CSS、JS檔案JS（解決html內訪問其他資源路徑問題）

這段時間開發IOS應用，自己本身是搞java web 和 android，搞ios應用後面還有好多坑要跳，所以學習一點就整理一點筆記。不敢保證內容都是對的，但至少，我嘗試過分析整理的。 UIWebVIew和WKWebView都是ios提供的web控制元件。但是

django-將數據庫數據轉換成JSON格式（ORM和SQL兩種情況）

user 展示 blog serialize 進行項目開發不管怎麽說語句 spa 最近打算搞一個自動化運維平臺，所以在看Django的知識。在實際項目開發中，遇到一個小問題：前後端發生數據交互主流采用的是JSON。前端傳數據到服務器端比較簡單，稍微麻煩的是服務器端傳

centos下將（jgp、png）圖片轉換成webp格式

sys 安裝學習 isa 幫助 webp格式 system pos ini 由於項目要求需要將jpg、png類型的圖片轉換成webp格式，最開始使用了php gd類庫裏 imagewebp 方法實現，結果發現轉換成的webp格式文件會偶爾出現空白內容的情況。像

將帶下劃線的字串轉換成大寫（下劃線後大寫）的高效方法

如test_tb_kkk_llll 轉換為 TestTbKkkLlll 原理： 1. 判斷是否包含下劃線 (1) 包含： &

ipython 格式轉換常用（轉換成html或md）

因為教學 ipynb轉為html格式（linux直接在終端裡面輸入，windows需要配置jupter）博主本人是win10 使用方法為開啟jupter，然後開啟jupter的終端，然後就可以敲下面的程式碼了，轉換成功後從jupter匯出到

輸入一棵二叉搜尋樹，將該二叉搜尋樹轉換成一個排序的雙向連結串列（劍指offer）

題目輸入一棵二叉搜尋樹，將該二叉搜尋樹轉換成一個排序的雙向連結串列。要求不能建立任何新的結點，只能調整樹中結點指標的指向。分析：在二叉搜尋樹中，每個結點都有兩個分別指向其左、右子樹的指標，左子樹結點的值總是小於父結點的值，右子樹結點的值總是大於父結點的值。在雙向連結串列中，每個結點

c++文字（字串）格式轉換

//QString ->const wchar_t * const wchar_t * fileNameC = reinterpret_cast<const wchar_t *>([QString].utf16()); //const wchar_t *

Pandas將列表（List）轉換為資料框（Dataframe）

Python中將列表轉換成為資料框有兩種情況：第一種是兩個不同列表轉換成一個數據框，第二種是一個包含不同子列表的列表轉換成為資料框。第一種：兩個不同列表轉換成為資料框 from pandas.

python3 爬蟲日記（二）將資料存到Mongodb

python版本：3.6.1 開發工具：PyCharm社群版，Anaconda3 資料庫：MongoDB 視覺化MongoDB工具：MongoVUE 1.開啟資料庫後，開啟MongoVUE使MongoDB視覺化。 2.用PyCharm編寫程式碼，爬取資料並儲存到資料庫中。

Oracle SQL將欄位所有的值轉換成數字（忽略不匹配的值）

最近由於業務需要，將某個欄位的值（Varchar2型別）轉換成數字。由於原始資料比較亂，在將該欄位直接轉換成數字時，由於存在非數字字元（英文字母、漢語），直接轉換時，轉換失敗。因此需要將這些欄位值轉換成0，將其他正常數字進行正常轉換，簡單範例如下： SELECT A1

使用Python3將Markdown（.md）文字轉換成 html、pdf

相關推薦