1. 程式人生 > >openoffice jodconverter 文件轉換pdf過坑記錄

openoffice jodconverter 文件轉換pdf過坑記錄

文件轉換效能測試
在財務系統中使用了兩種PDF轉換元件
一種是com.artofsolving,也是系統第一次引用的元件:

<!-- https://mvnrepository.com/artifact/com.artofsolving/jodconverter-->
        <dependency>
            <groupId>com.artofsolving</groupId>
            <artifactId>jodconverter</artifactId>
            <version
>
2.2.1</version> </dependency>

另外一種是org.artofsolving,系統第二次引用的上傳元件:

<!-- https://mvnrepository.com/artifact/org.artofsolving.jodconverter/jodconverter-core -->
        <dependency>
            <groupId>org.artofsolving.jodconverter</groupId>
            <artifactId
>
jodconverter-core</artifactId> <version>3.0-beta-4</version> </dependency>

這兩種在專案開發測試過程中有不同的表現,
首先openoffice是4.1.2
支援建議:
* 微軟 Windows XP, Vista, Windows 7 或者 Windows 8
* Pentium III 或更高系列處理器
* 256 MB RAM(建議使用 512 MB RAM)
* 高達 1.5 GB 的硬碟可用空間
* 1024x768 解析度(建議使用更高解析度),至少 256 色

這裡寫圖片描述
Dropzone:支援的配置
單個檔案最大支援100M上傳
沒有限制上傳檔案數量
同時上傳檔案的數量是3

下面主要看上傳後進行pdf轉換的效率
測試檔案:test.doc,test.ppt,test.xls
這裡寫圖片描述
通過上面的對比,除了org支援更多格式之外在速度上沒有優勢,並且轉換出來的文字清晰度比com低了一點點。

特別的在檔名有()符號的話Linux上傳讀取不到。
為什麼支援的檔案型別有區別,因為Com.artofsolving的原始碼中DocumentFormatRegistry有多種實現方式,這是一個介面,預設的文件格式註冊物件documentFormats list,中就沒有MS 2007的:

public class DefaultDocumentFormatRegistry extends BasicDocumentFormatRegistry {

   public DefaultDocumentFormatRegistry() {
      final DocumentFormat pdf = new DocumentFormat("Portable Document Format", "application/pdf", "pdf");
        pdf.setExportFilter(DocumentFamily.DRAWING, "draw_pdf_Export");
      pdf.setExportFilter(DocumentFamily.PRESENTATION, "impress_pdf_Export");
      pdf.setExportFilter(DocumentFamily.SPREADSHEET, "calc_pdf_Export");
      pdf.setExportFilter(DocumentFamily.TEXT, "writer_pdf_Export");
      addDocumentFormat(pdf);

      final DocumentFormat swf = new DocumentFormat("Macromedia Flash", "application/x-shockwave-flash", "swf");
        swf.setExportFilter(DocumentFamily.DRAWING, "draw_flash_Export");
      swf.setExportFilter(DocumentFamily.PRESENTATION, "impress_flash_Export");
      addDocumentFormat(swf);

      final DocumentFormat xhtml = new DocumentFormat("XHTML", "application/xhtml+xml", "xhtml");
      xhtml.setExportFilter(DocumentFamily.PRESENTATION, "XHTML Impress File");
      xhtml.setExportFilter(DocumentFamily.SPREADSHEET, "XHTML Calc File");
      xhtml.setExportFilter(DocumentFamily.TEXT, "XHTML Writer File");
      addDocumentFormat(xhtml);

      // HTML is treated as Text when supplied as input, but as an output it is also
      // available for exporting Spreadsheet and Presentation formats
      final DocumentFormat html = new DocumentFormat("HTML", DocumentFamily.TEXT, "text/html", "html");
      html.setExportFilter(DocumentFamily.PRESENTATION, "impress_html_Export");
      html.setExportFilter(DocumentFamily.SPREADSHEET, "HTML (StarCalc)");
      html.setExportFilter(DocumentFamily.TEXT, "HTML (StarWriter)");
      addDocumentFormat(html);

      final DocumentFormat odt = new DocumentFormat("OpenDocument Text", DocumentFamily.TEXT, "application/vnd.oasis.opendocument.text", "odt");
      odt.setExportFilter(DocumentFamily.TEXT, "writer8");
      addDocumentFormat(odt);

      final DocumentFormat sxw = new DocumentFormat("OpenOffice.org 1.0 Text Document", DocumentFamily.TEXT, "application/vnd.sun.xml.writer", "sxw");
      sxw.setExportFilter(DocumentFamily.TEXT, "StarOffice XML (Writer)");
      addDocumentFormat(sxw);

      final DocumentFormat doc = new DocumentFormat("Microsoft Word", DocumentFamily.TEXT, "application/msword", "doc");
      doc.setExportFilter(DocumentFamily.TEXT, "MS Word 97");
      addDocumentFormat(doc);

      final DocumentFormat rtf = new DocumentFormat("Rich Text Format", DocumentFamily.TEXT, "text/rtf", "rtf");
      rtf.setExportFilter(DocumentFamily.TEXT, "Rich Text Format");
      addDocumentFormat(rtf);

      final DocumentFormat wpd = new DocumentFormat("WordPerfect", DocumentFamily.TEXT, "application/wordperfect", "wpd");
      addDocumentFormat(wpd);

      final DocumentFormat txt = new DocumentFormat("Plain Text", DocumentFamily.TEXT, "text/plain", "txt");
        // set FilterName to "Text" to prevent OOo from tryign to display the "ASCII Filter Options" dialog
        // alternatively FilterName could be "Text (encoded)" and FilterOptions used to set encoding if needed
        txt.setImportOption("FilterName", "Text");
      txt.setExportFilter(DocumentFamily.TEXT, "Text");
      addDocumentFormat(txt);

      final DocumentFormat wikitext = new DocumentFormat("MediaWiki wikitext", "text/x-wiki", "wiki");
      wikitext.setExportFilter(DocumentFamily.TEXT, "MediaWiki");
        addDocumentFormat(wikitext);

      final DocumentFormat ods = new DocumentFormat("OpenDocument Spreadsheet", DocumentFamily.SPREADSHEET, "application/vnd.oasis.opendocument.spreadsheet", "ods");
      ods.setExportFilter(DocumentFamily.SPREADSHEET, "calc8");
      addDocumentFormat(ods);

      final DocumentFormat sxc = new DocumentFormat("OpenOffice.org 1.0 Spreadsheet", DocumentFamily.SPREADSHEET, "application/vnd.sun.xml.calc", "sxc");
      sxc.setExportFilter(DocumentFamily.SPREADSHEET, "StarOffice XML (Calc)");
      addDocumentFormat(sxc);

      final DocumentFormat xls = new DocumentFormat("Microsoft Excel", DocumentFamily.SPREADSHEET, "application/vnd.ms-excel", "xls");
      xls.setExportFilter(DocumentFamily.SPREADSHEET, "MS Excel 97");
      addDocumentFormat(xls);

        final DocumentFormat csv = new DocumentFormat("CSV", DocumentFamily.SPREADSHEET, "text/csv", "csv");
        csv.setImportOption("FilterName", "Text - txt - csv (StarCalc)");
        csv.setImportOption("FilterOptions", "44,34,0");  // Field Separator: ','; Text Delimiter: '"'  
        csv.setExportFilter(DocumentFamily.SPREADSHEET, "Text - txt - csv (StarCalc)");
        csv.setExportOption(DocumentFamily.SPREADSHEET, "FilterOptions", "44,34,0");
        addDocumentFormat(csv);

        final DocumentFormat tsv = new DocumentFormat("Tab-separated Values", DocumentFamily.SPREADSHEET, "text/tab-separated-values", "tsv");
        tsv.setImportOption("FilterName", "Text - txt - csv (StarCalc)");
        tsv.setImportOption("FilterOptions", "9,34,0");  // Field Separator: '\t'; Text Delimiter: '"'
        tsv.setExportFilter(DocumentFamily.SPREADSHEET, "Text - txt - csv (StarCalc)");
        tsv.setExportOption(DocumentFamily.SPREADSHEET, "FilterOptions", "9,34,0");
        addDocumentFormat(tsv);

      final DocumentFormat odp = new DocumentFormat("OpenDocument Presentation", DocumentFamily.PRESENTATION, "application/vnd.oasis.opendocument.presentation", "odp");
      odp.setExportFilter(DocumentFamily.PRESENTATION, "impress8");
      addDocumentFormat(odp);

      final DocumentFormat sxi = new DocumentFormat("OpenOffice.org 1.0 Presentation", DocumentFamily.PRESENTATION, "application/vnd.sun.xml.impress", "sxi");
      sxi.setExportFilter(DocumentFamily.PRESENTATION, "StarOffice XML (Impress)");
      addDocumentFormat(sxi);

      final DocumentFormat ppt = new DocumentFormat("Microsoft PowerPoint", DocumentFamily.PRESENTATION, "application/vnd.ms-powerpoint", "ppt");
      ppt.setExportFilter(DocumentFamily.PRESENTATION, "MS PowerPoint 97");
      addDocumentFormat(ppt);

        final DocumentFormat odg = new DocumentFormat("OpenDocument Drawing", DocumentFamily.DRAWING, "application/vnd.oasis.opendocument.graphics", "odg");
        odg.setExportFilter(DocumentFamily.DRAWING, "draw8");
        addDocumentFormat(odg);

        final DocumentFormat svg = new DocumentFormat("Scalable Vector Graphics", "image/svg+xml", "svg");
        svg.setExportFilter(DocumentFamily.DRAWING, "draw_svg_Export");
        addDocumentFormat(svg);
   }
}
而org中則有原始碼如下:

public class DefaultDocumentFormatRegistry extends SimpleDocumentFormatRegistry {
    public DefaultDocumentFormatRegistry() {
        DocumentFormat pdf = new DocumentFormat("Portable Document Format", "pdf", "application/pdf");
        pdf.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "writer_pdf_Export"));
        pdf.setStoreProperties(DocumentFamily.SPREADSHEET, Collections.singletonMap("FilterName", "calc_pdf_Export"));
        pdf.setStoreProperties(DocumentFamily.PRESENTATION, Collections.singletonMap("FilterName", "impress_pdf_Export"));
        pdf.setStoreProperties(DocumentFamily.DRAWING, Collections.singletonMap("FilterName", "draw_pdf_Export"));
        this.addFormat(pdf);
        DocumentFormat swf = new DocumentFormat("Macromedia Flash", "swf", "application/x-shockwave-flash");
        swf.setStoreProperties(DocumentFamily.PRESENTATION, Collections.singletonMap("FilterName", "impress_flash_Export"));
        swf.setStoreProperties(DocumentFamily.DRAWING, Collections.singletonMap("FilterName", "draw_flash_Export"));
        this.addFormat(swf);
        DocumentFormat html = new DocumentFormat("HTML", "html", "text/html");
        html.setInputFamily(DocumentFamily.TEXT);
        html.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "HTML (StarWriter)"));
        html.setStoreProperties(DocumentFamily.SPREADSHEET, Collections.singletonMap("FilterName", "HTML (StarCalc)"));
        html.setStoreProperties(DocumentFamily.PRESENTATION, Collections.singletonMap("FilterName", "impress_html_Export"));
        this.addFormat(html);
        DocumentFormat odt = new DocumentFormat("OpenDocument Text", "odt", "application/vnd.oasis.opendocument.text");
        odt.setInputFamily(DocumentFamily.TEXT);
        odt.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "writer8"));
        this.addFormat(odt);
        DocumentFormat sxw = new DocumentFormat("OpenOffice.org 1.0 Text Document", "sxw", "application/vnd.sun.xml.writer");
        sxw.setInputFamily(DocumentFamily.TEXT);
        sxw.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "StarOffice XML (Writer)"));
        this.addFormat(sxw);
        DocumentFormat doc = new DocumentFormat("Microsoft Word", "doc", "application/msword");
        doc.setInputFamily(DocumentFamily.TEXT);
        doc.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "MS Word 97"));
        this.addFormat(doc);
        DocumentFormat docx = new DocumentFormat("Microsoft Word 2007 XML", "docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document");
        docx.setInputFamily(DocumentFamily.TEXT);
        this.addFormat(docx);
        DocumentFormat rtf = new DocumentFormat("Rich Text Format", "rtf", "text/rtf");
        rtf.setInputFamily(DocumentFamily.TEXT);
        rtf.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "Rich Text Format"));
        this.addFormat(rtf);
        DocumentFormat wpd = new DocumentFormat("WordPerfect", "wpd", "application/wordperfect");
        wpd.setInputFamily(DocumentFamily.TEXT);
        this.addFormat(wpd);
        DocumentFormat txt = new DocumentFormat("Plain Text", "txt", "text/plain");
        txt.setInputFamily(DocumentFamily.TEXT);
        LinkedHashMap txtLoadAndStoreProperties = new LinkedHashMap();
        txtLoadAndStoreProperties.put("FilterName", "Text (encoded)");
        txtLoadAndStoreProperties.put("FilterOptions", "utf8");
        txt.setLoadProperties(txtLoadAndStoreProperties);
        txt.setStoreProperties(DocumentFamily.TEXT, txtLoadAndStoreProperties);
        this.addFormat(txt);
        DocumentFormat wikitext = new DocumentFormat("MediaWiki wikitext", "wiki", "text/x-wiki");
        wikitext.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "MediaWiki"));
        DocumentFormat ods = new DocumentFormat("OpenDocument Spreadsheet", "ods", "application/vnd.oasis.opendocument.spreadsheet");
        ods.setInputFamily(DocumentFamily.SPREADSHEET);
        ods.setStoreProperties(DocumentFamily.SPREADSHEET, Collections.singletonMap("FilterName", "calc8"));
        this.addFormat(ods);
        DocumentFormat sxc = new DocumentFormat("OpenOffice.org 1.0 Spreadsheet", "sxc", "application/vnd.sun.xml.calc");
        sxc.setInputFamily(DocumentFamily.SPREADSHEET);
        sxc.setStoreProperties(DocumentFamily.SPREADSHEET, Collections.singletonMap("FilterName", "StarOffice XML (Calc)"));
        this.addFormat(sxc);
        DocumentFormat xls = new DocumentFormat("Microsoft Excel", "xls", "application/vnd.ms-excel");
        xls.setInputFamily(DocumentFamily.SPREADSHEET);
        xls.setStoreProperties(DocumentFamily.SPREADSHEET, Collections.singletonMap("FilterName", "MS Excel 97"));
        this.addFormat(xls);
        DocumentFormat xlsx = new DocumentFormat("Microsoft Excel 2007 XML", "xlsx", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
        xlsx.setInputFamily(DocumentFamily.SPREADSHEET);
        this.addFormat(xlsx);
        DocumentFormat csv = new DocumentFormat("Comma Separated Values", "csv", "text/csv");
        csv.setInputFamily(DocumentFamily.SPREADSHEET);
        LinkedHashMap csvLoadAndStoreProperties = new LinkedHashMap();
        csvLoadAndStoreProperties.put("FilterName", "Text - txt - csv (StarCalc)");
        csvLoadAndStoreProperties.put("FilterOptions", "44,34,0");
        csv.setLoadProperties(csvLoadAndStoreProperties);
        csv.setStoreProperties(DocumentFamily.SPREADSHEET, csvLoadAndStoreProperties);
        this.addFormat(csv);
        DocumentFormat tsv = new DocumentFormat("Tab Separated Values", "tsv", "text/tab-separated-values");
        tsv.setInputFamily(DocumentFamily.SPREADSHEET);
        LinkedHashMap tsvLoadAndStoreProperties = new LinkedHashMap();
        tsvLoadAndStoreProperties.put("FilterName", "Text - txt - csv (StarCalc)");
        tsvLoadAndStoreProperties.put("FilterOptions", "9,34,0");
        tsv.setLoadProperties(tsvLoadAndStoreProperties);
        tsv.setStoreProperties(DocumentFamily.SPREADSHEET, tsvLoadAndStoreProperties);
        this.addFormat(tsv);
        DocumentFormat odp = new DocumentFormat("OpenDocument Presentation", "odp", "application/vnd.oasis.opendocument.presentation");
        odp.setInputFamily(DocumentFamily.PRESENTATION);
        odp.setStoreProperties(DocumentFamily.PRESENTATION, Collections.singletonMap("FilterName", "impress8"));
        this.addFormat(odp);
        DocumentFormat sxi = new DocumentFormat("OpenOffice.org 1.0 Presentation", "sxi", "application/vnd.sun.xml.impress");
        sxi.setInputFamily(DocumentFamily.PRESENTATION);
        sxi.setStoreProperties(DocumentFamily.PRESENTATION, Collections.singletonMap("FilterName", "StarOffice XML (Impress)"));
        this.addFormat(sxi);
        DocumentFormat ppt = new DocumentFormat("Microsoft PowerPoint", "ppt", "application/vnd.ms-powerpoint");
        ppt.setInputFamily(DocumentFamily.PRESENTATION);
        ppt.setStoreProperties(DocumentFamily.PRESENTATION, Collections.singletonMap("FilterName", "MS PowerPoint 97"));
        this.addFormat(ppt);
        DocumentFormat pptx = new DocumentFormat("Microsoft PowerPoint 2007 XML", "pptx", "application/vnd.openxmlformats-officedocument.presentationml.presentation");
        pptx.setInputFamily(DocumentFamily.PRESENTATION);
        this.addFormat(pptx);
        DocumentFormat odg = new DocumentFormat("OpenDocument Drawing", "odg", "application/vnd.oasis.opendocument.graphics");
        odg.setInputFamily(DocumentFamily.DRAWING);
        odg.setStoreProperties(DocumentFamily.DRAWING, Collections.singletonMap("FilterName", "draw8"));
        this.addFormat(odg);
        DocumentFormat svg = new DocumentFormat("Scalable Vector Graphics", "svg", "image/svg+xml");
        svg.setStoreProperties(DocumentFamily.DRAWING, Collections.singletonMap("FilterName", "draw_svg_Export"));
        this.addFormat(svg);
    }
}

原理實現基本差不多,可能通過定製化來實現com的多種檔案方式支援。
對於測試檔案數量和大小的不同所花費的時間也不同,多檔案,中型檔案大小採用序列方式進行pdf轉換所用時間肯定比較長,這裡可以通過改為並行的方式來加快處理速度。
特別的,org有兩種建立轉換方式,一種支援MS 2007的,另一種不支援:
不支援MS 2007:

DocumentConverter converter = new StreamOpenOfficeDocumentConverter(connection);   
但是網上說可以解決:
com.artofsolving.jodconverter.openoffice.connection.OpenOfficeException: conversion failed: could not load input document的異常,也就是檔名在Linux系統中路徑解析的問題。
支援:
OfficeManager officeManager = getOfficeManager();
// 連線OpenOffice
OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager);

Com的建立轉換物件的方式:
connection.connect();
DocumentConverter converter = new OpenOfficeDocumentConverter(connection);
com同時也有通過StreamOpenOfficeDocumentConverter建立轉換物件的方式,本系統沒有采用該方式。

綜上,如果平均上傳的檔案不大於5M,並且不超過5個檔案,系統可以在10秒內處理完成。
在後面的測試中如果檔案大於10M,轉換頻率較高則會消耗系統資源,無法完成轉換,後面提交的轉換任務在元件的任務佇列中將不會被接受。這裡有個性能問題,大檔案轉換(20M左右)有時候會出現超時,而原始碼中設定的單個pdf轉換任務的執行時間是120s,超時則報錯,並重新進行連線,處理下一個任務。

在附件上傳的開發中出現了很多坑:

  1. 無法讀取輸入的檔案—端口占用,重新啟動
  2. 無法解析檔名中的特殊字串—這裡跟阿里雲檔案上傳有關
  3. 端口占用—無法繼續處理其他小檔案的轉換工作
    將連線openoffice的程式碼修改一下,首先連線已經啟動的openoffice服務,否則重啟新建連線轉換服務。(程式碼級修復)
  4. 不支援docx,等高版本MS 文件。(新增org元件解決該問題)
  5. 不支援併發處理,不支援大檔案轉換
    這裡閱讀原始碼後發現無法進行優化,所使用的元件基本沒有原始碼,看到的也僅僅是反編譯的。
  6. 在Windows上和Linux上的openoffice表現不太一樣,主要就是轉換時間,對檔案格式,檔名,檔案型別的解析不太一樣。