Apache PdfBox 2.0.X 版本解析PDF文件（文字和圖片）

阿新 • • 發佈：2019-01-09

最近專案開發過程涉及到了pdf檔案的內容的解析和和內容的提取入庫操作，其中pdf的解析採用了開源的apache pdfbox 外掛，版本選用的是最新版本的2.0.8版本，現將簡單的讀取解析的步驟記錄如下：

Apache下載連結如下：

https://pdfbox.apache.org/download.cgi

mvean可以如下新增：

                    <dependency>
                            <groupId>org.apache.pdfbox</groupId>
                            <artifactId>pdfbox</artifactId>
                            <version>2.0.8</version>

</dependency>

                      <dependency>
                            <groupId>org.apache.pdfbox</groupId>
                            <artifactId>fontbox</artifactId>
      <version>2.0.8</version>

</dependency>

2、從PDF中獲取文字內容：

首先讀取檔案，或者獲取web上傳的檔案流，然後生成pdfdocument，最後document進行遍歷解析，封裝自己想要的資料或者物件，具體的解析程式碼如下：

     /**
   * 從pdf檔案中解析為字串,只能返回pdf中的文字內容，圖片，表格均解析不了
   * @param pdfFile
   * @param sort 是否有序
   * @return
   * @throws Exception
   */
   public static String getTextFromPdf(InputStream fileStream, boolean sort) {
       // 開始提取頁數
       int startPage = 1;
       // 結束提取頁數
       String content = null;
       PDDocument document = null;
       try {
           // 載入 pdf 文件
           document = PDDocument.load(fileStream);
           int endPage = null == document ? Integer.MAX_VALUE : document.getNumberOfPages();
           PDFTextStripper stripper = new PDFTextStripper();
           stripper.setSortByPosition(sort);
           stripper.setStartPage(startPage);
           stripper.setEndPage(endPage);
           content = stripper.getText(document);
           log.info("pdf 檔案解析，內容為：" + content);
       } catch (Exception e) {
           log.error("檔案解析異常，資訊為： " + e.getMessage());
       }
       return content;

   }

3、從pdf文件中抓取圖片的列表資訊（話不多說，直接貼程式碼）

        /**
   * 從pdf文件中讀取所有的圖片資訊
   *
   * @return
   * @throws Exception
   */
   public static List<PDImageXObject> getImageListFromPDF(PDDocument document,Integer startPage) throws Exception {
       List<PDImageXObject> imageList = new ArrayList<PDImageXObject>();
       if(null != document){
           PDPageTree pages = document.getPages();
           startPage = startPage == null ? 0 : startPage;
           int len = pages.getCount();
           if(startPage < len){
               for(int i=startPage;i<len;i++){
                   PDPage page = pages.get(i);
                   Iterable<COSName> objectNames = page.getResources().getXObjectNames();
                   for(COSName imageObjectName : objectNames){
                       if(page.getResources().isImageXObject(imageObjectName)){
                           imageList.add((PDImageXObject) page.getResources().getXObject(imageObjectName));
                       }
                   }
               }
           }
       }
       return imageList;
   }

注意：上個方法中返回的list中為 PDImageXObject 物件，不是我們Java中對應的Image物件，所以不能直接儲存到本地或者提交到伺服器，需要進行簡單的轉換一下，例子可參考如下：

        /**
   * 讀取圖片檔案流資訊
   * @param iamge
   * @return
   * @throws Exception
   */
   public static InputStream getImageInputStream(PDImageXObject iamge) throws Exception
   {
       if(null!=iamge && null!= iamge.getImage())
       {
           BufferedImage bufferImage = iamge.getImage();
           ByteArrayOutputStream os = new ByteArrayOutputStream();
            ImageIO.write(bufferImage, iamge.getSuffix(), os);
           return new ByteArrayInputStream(os.toByteArray());
       }
       return null;
   }

這樣就可以讀取到對應的圖片的例子，可以new File物件寫到磁碟上，如：

                    File imgFile = new File("e:\\"+name+"."+image.getSuffix());
                    FileOutputStream fout = new FileOutputStream(imgFile);
                    ByteArrayOutputStream os = new ByteArrayOutputStream();
                    ImageIO.write(imageb, image.getSuffix(), os);
                    InputStream is = new ByteArrayInputStream(os.toByteArray());
                    int byteCount = 0;
                    byte[] bytes = new byte[1024];

                    while ((byteCount = is.read(bytes)) > 0)
                    {
                        fout.write(bytes,0,byteCount);
                    }

                    fout.close();

is.close();

以上僅供參考，經測試可以解析到文字和圖片並且可以儲存入庫和view層展示下載等，程式碼只是實現了原理，沒有進行進一步的優化，希望大家指正，謝謝

Apache PdfBox 2.0.X 版本解析PDF文件（文字和圖片）

Apache PdfBox 2.0.X 版本解析PDF文件（文字和圖片）

C#儀器數據文件解析-Excel文件（xls、xlsx）

C#儀器數據文件解析-Word文件（doc、docx）

C# 插入超連結到PDF文件（3種情況）

Apache PDFBox 2.0.13 釋出，Java 的 PDF 處理類庫

SpringBoot 中對應2.0.x版本的Redis配置application.properties

Apache Mina 2.0.x 入門

springboot 2.1.3.RELEASE版本解析.properties文件配置

python 解析pdf文件的首、尾頁

Apache POI 4.0.1 釋出，Office 文件的 Java API

python解析PDF文件

BatchOutput PDF 2.2.32 Mac 破解版 PDF文件自動批量列印工具

Django 2.0 之Models(模型) 官方文件翻譯（一）

使用itextsharp畫pdf文件（工作總結）

CentOS 7.0 使用Vsftpd服務傳輸文件（唐傑）

simpleXML技術解析xml文件（php）

解決下載ftp文件過程中，瀏覽器直接解析文件（txt,png等）的問題

mybatis源碼-解析配置文件（二）之解析的流程

mybatis源碼-解析配置文件（三）之配置文件Configuration解析(超詳細，值得收藏)

mybatis源碼-解析配置文件（四）之配置文件Mapper解析

Apache PdfBox 2.0.X 版本解析PDF文件（文字和圖片）

相關推薦