1. 程式人生 > >用pyton將word文件轉成html和pdf

用pyton將word文件轉成html和pdf

程式碼是從百度上找的,如下:

#!/usr/bin/env python 
#coding=utf-8 
from win32com import client as wc 
word = wc.Dispatch('Word.Application') 
doc = word.Documents.Open('e:/1.doc') 
doc.SaveAs('e:/1.html', 8) 
doc.SaveAs('e:/2.pdf', 17) 
doc.SaveAs('e:/3.html', 10) 
doc.Close() 
word.Quit()

'''
win32com download 
http://sourceforge.net/projects/pywin32/files/pywin32/Build%20218

這裡測試的環境是:windows xp,office 2007,python 2.5.2,pywin32 build 213,原理是利用win32com介面直接呼叫office API,好處是簡單、相容性好,只要office能處理的,python都可以處理,處理出來的結果和office word裡面“另存為”一致。

原文地址:http://www.fuchaoqun.com/2009/03/use-python-convert-word-to-html-with-win32com/
view source
print
?
1.#!/usr/bin/env python 
2.#coding=utf-8 
3.from win32com import client as wc 
4.word = wc.Dispatch('Word.Application') 
5.doc = word.Documents.Open('d:/labs/math.doc') 
6.doc.SaveAs('d:/labs/math.html', 8 ) 
7.doc.Close() 
8.word.Quit()

關鍵的就是doc.SaveAs(’d:/labs/math.html’, 8)這一行,網上很多文章寫成:doc.SaveAs(’d:/labs/math.html’, win32com.client.constants.wdFormatHTML),直接報錯:

AttributeError: class Constants has no attribute ‘wdFormatHTML’

當然你也可以用上面的程式碼將word檔案轉換成任意格式檔案(只要office 2007支援,比如將word檔案轉換成PDF檔案,把8改成17即可),下面是office 2007支援的全部檔案格式對應表:

wdFormatDocument = 0
wdFormatDocument97 = 0
wdFormatDocumentDefault = 16
wdFormatDOSText = 4
wdFormatDOSTextLineBreaks = 5
wdFormatEncodedText = 7
wdFormatFilteredHTML = 10
wdFormatFlatXML = 19
wdFormatFlatXMLMacroEnabled = 20
wdFormatFlatXMLTemplate = 21
wdFormatFlatXMLTemplateMacroEnabled = 22
wdFormatHTML = 8
wdFormatPDF = 17
wdFormatRTF = 6
wdFormatTemplate = 1
wdFormatTemplate97 = 1
wdFormatText = 2
wdFormatTextLineBreaks = 3
wdFormatUnicodeText = 7
wdFormatWebArchive = 9
wdFormatXML = 11
wdFormatXMLDocument = 12
wdFormatXMLDocumentMacroEnabled = 13
wdFormatXMLTemplate = 14
wdFormatXMLTemplateMacroEnabled = 15
wdFormatXPS = 18

照著字面意思應該能對應到相應的檔案格式,如果你是office 2003可能支援不了這麼多格式。word檔案轉html有兩種格式可選wdFormatHTML、wdFormatFilteredHTML(對應數字 8、10),區別是如果是wdFormatHTML格式的話,word檔案裡面的公式等ole物件將會儲存成wmf格式,而選用 wdFormatFilteredHTML的話公式圖片將儲存為gif格式,而且目測可以看出用wdFormatFilteredHTML生成的HTML 明顯比wdFormatHTML要乾淨許多。

當然你也可以用任意一種語言通過com來呼叫office API,比如PHP. 
'''
注意事項:

pywin32的版本要和python的版本一致,比如我的64位機器安裝的是32位的python,如果安裝64位的pywin32在執行時直接DLL報錯,安裝32位的則正常。


如果安裝過程提示登錄檔錯誤如下:

那麼直接執行下面這個python程式即可:

#
# script to register Python 2.0 or later for use with win32all
# and other extensions that require Python registry settings
#
# written by Joakim Loew for Secret Labs AB / PythonWare
#
# source:
# http://www.pythonware.com/products/works/articles/regpy20.htm
#
# modified by Valentine Gogichashvili as described in http://www.mail-archive.com/
[email protected]
/msg10512.html import sys from _winreg import * # tweak as necessary version = sys.version[:3] installpath = sys.prefix regpath = "SOFTWARE\\Python\\Pythoncore\\%s\\" % (version) installkey = "InstallPath" pythonkey = "PythonPath" pythonpath = "%s;%s\\Lib\\;%s\\DLLs\\" % ( installpath, installpath, installpath ) def RegisterPy(): try: reg = OpenKey(HKEY_CURRENT_USER, regpath) except EnvironmentError as e: try: reg = CreateKey(HKEY_CURRENT_USER, regpath) SetValue(reg, installkey, REG_SZ, installpath) SetValue(reg, pythonkey, REG_SZ, pythonpath) CloseKey(reg) except: print "*** Unable to register!" return print "--- Python", version, "is now registered!" return if (QueryValue(reg, installkey) == installpath and QueryValue(reg, pythonkey) == pythonpath): CloseKey(reg) print "=== Python", version, "is already registered!" return CloseKey(reg) print "*** Unable to register!" print "*** You probably have another Python installation!" if __name__ == "__main__": RegisterPy()