C#讀取PDF、TXT內容

阿新 • • 發佈：2019-01-16

//讀取PDF內容
private void button2_Click(object sender, EventArgs e)
        {
            label3.Text = OnCreated("D:\\aa.pdf");
        }

        private string OnCreated(string filepath)
        {
            try
            {
                string pdffilename = filepath;
                PdfReader pdfReader = new PdfReader(pdffilename);
                int numberOfPages = pdfReader.NumberOfPages;
                string text = string.Empty;

                for (int i = 1; i <= numberOfPages; ++i)
                {
                    iTextSharp.text.pdf.parser.ITextExtractionStrategy strategy = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
                    text += iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(pdfReader, i, strategy);
                }
                pdfReader.Close();

                return text;
            }
            catch (Exception ex)
            {
                StreamWriter wlog = File.AppendText(System.AppDomain.CurrentDomain.SetupInformation.ApplicationBase + "\\mylog.log");
                wlog.WriteLine("出錯檔案："  + "原因：" + ex.ToString());
                wlog.Flush();
                wlog.Close(); return null;
            }


//讀取TXT
string text = System.IO.File.ReadAllText(path);//讀取內容 path為檔案路徑
text = text.Replace("\n", string.Empty).Replace("\r", string.Empty);//去掉字串裡的\n \r符號



例項：

 //1． 生成一個PDF,將文字和圖片新增到PDF裡面。
        //2． 從PDF文件中提取所有圖片。
        //3． 從PDF文件中提取所有文字。

       //生成一個PDF檔案 裡面包含文字和圖片
        private void button2_Click(object sender, EventArgs e)
        {
            Spire.Pdf.PdfDocument doc = new Spire.Pdf.PdfDocument();
            PdfPageBase page = doc.Pages.Add();

            //新增文字  
            page.Canvas.DrawString("Hello!Welcome to my house!",
            new Spire.Pdf.Graphics.PdfFont(PdfFontFamily.Helvetica, 20f),
            new PdfSolidBrush(Color.Black), 10, 10);//中文漢字字元均不能正確生成 英文字母可以

            //新增圖片
            Spire.Pdf.Graphics.PdfImage image = Spire.Pdf.Graphics.PdfImage.FromFile("ff.jpg");
            float width = image.Width * 0.75f;
            float height = image.Height * 0.75f;
            float x = (page.Canvas.ClientSize.Width - width) / 2;
            page.Canvas.DrawImage(image, x, 60, width, height);

            //Spire.Pdf.Graphics.PdfImage image2 = Spire.Pdf.Graphics.PdfImage.FromFile("image.jpg");
            //width = image2.Width * 0.75f;
            //height = image2.Height * 0.75f;
            //page.Canvas.DrawImage(image2, x - 100, 220, width, height);
            doc.SaveToFile("sample.pdf");
        }

        //讀取圖片 獲取圖片個數 並把圖片儲存到本地
        private void button1_Click(object sender, EventArgs e)
        {
            Spire.Pdf.PdfDocument doc = new Spire.Pdf.PdfDocument();
            doc.LoadFromFile("sample.pdf");
            IList<Image> images = new List<Image>();
            foreach (PdfPageBase page in doc.Pages)
            {
                if (page.ExtractImages() != null)
                {
                    foreach (Image image in page.ExtractImages())
                    {
                        images.Add(image);
                    }
                }
            }
            doc.Close();
            int index = 0;
            int aa = images.Count;
            label3.Text = aa.ToString();
            foreach (Image image in images)
            {
                String imageFileName = String.Format("Image-{0}.png", index++);
                image.Save(imageFileName, ImageFormat.Png);
            }
        }

        //讀取文字
        private void button3_Click(object sender, EventArgs e)
        {
            Spire.Pdf.PdfDocument doc = new Spire.Pdf.PdfDocument();
            doc.LoadFromFile("sample.pdf");

            StringBuilder buffer = new StringBuilder();
            foreach (PdfPageBase page in doc.Pages)
            {
                buffer.Append(page.ExtractText());
            }
            doc.Close();
            label1.Text = buffer.ToString();//在介面顯示讀取到的文字
            //把讀取到的文字寫入TXT檔案
            //String fileName = "TextInPdf.txt";
            //File.WriteAllText(fileName, buffer.ToString());
            buffer = null;
        }

參考：http://www.cnblogs.com/Yesi/p/4203686.html

C#讀取PDF、TXT內容

//讀取PDF內容 private void button2_Click(object sender, EventArgs e) { label3.Text = OnCreated("D:\\aa.pdf"); }

java通過url線上預覽Word、excel、ppt、pdf、txt文件中的內容【只獲得其中的文字】

在頁面上顯示各種文件中的內容。在servlet中的邏輯 word： BufferedInputStream bis = null; URL url = null; HttpURLConnection httpUrl = null; // 建立連結 url

Unity3D讀取PDF文件內容

讀取直接 min posit rec unity3d gettext fim write 最近在研究Unity3D中讀取PDF的內容，預想了三種方案，一是用Java來實現，二是調用C#的iTextSharp庫或者PDFBox庫來實現，三是下載PDF Renderer插件（

C++讀取配置檔案.txt連線資料庫

mysql=mysql_init((MYSQL*)0); ifstream file; string path="D:/data.txt"; file.open(path.c_str()); string port1; string url; string name; str

C# 讀取PDF多級書籤

在PDF中，書籤作為一種導航的有效工具，能幫助我們快速地定位到文件中的指定段落。同時，書籤也能讓人對文件結構一目瞭然，在某種程度上也可作為目錄使用。對於C#操作PDF中的書籤，在上一篇文章中介紹了具體的如何新增書籤、修改已有書籤以及刪除書籤的操作，在本篇文章中，將介紹C#如何讀取PDF中的多級書籤。工具

C#讀取PDF ——PDFBox使用

二、引用動態連結庫解壓縮下載的PDFBox，找到其中的Bin目錄，需要在專案中新增引用的dll檔案有： IKVM.GNU.Classpath.dll PDFBox-0.7.3.dll FontBox-0.1.0-dev.dll IKVM.Ru

Python程式設計：讀取pdf、pptx、docx、xlsx檔案的頁數

pdf 安裝工具 pip install pdfplumber 程式碼示例 import pdfplumber from pdfminer.pdfparser import PDFSyntaxError def get_pdf_page(pdf_path):

Pandas學習筆記(三)——讀取 CSV、TXT檔案

pandas是資料分析專用庫。從外部讀寫檔案也屬於資料處理的一部分。pandas提供了多種I/O API函式。支援多種型別資料的讀取。常用的函式如下：讀取函式寫入函式read_csv to_csvread_excelto_excelread_hdfto_hdfread_s

C++讀取網路url檔案內容

檔案地址為“http://www.baidu.com/xxx.txt” // readTxt.cpp :Defines the entry point for the console application. // #include "stdafx.h" #inclu

使用Lucene對doc、docx、pdf、txt文件進行全文檢索功能的實現

這裡講一下使用Lucene對doc、docx、pdf、txt文件進行全文檢索功能的實現。涉及到的類一共有兩個： LuceneCreateIndex，建立索引： package com.yhd.test.poi; import java.io.BufferedReader; impo

Asp.net實現直接在瀏覽器預覽Word、Excel、PDF、Txt檔案（附原始碼）

publicstaticvoid Priview(System.Web.UI.Page p, string inFilePath, string outDirPath ="") { Microsoft.Office.Interop.Excel.Application excel =nul

Apache tika -- 解析多種型別（word、pdf、txt 等）檔案！

http://cloudera.iteye.com/blog/737629 apache 是個偉大的組織。在lucene 檢索如火如荼時， apache不忘繼續努力，近期提供了對各種格式檔案進行解析的解決方案 -- apache旗下的tika. 雖然還沒有1.0版，但已經很好用： Jav

C#讀取pdf檔案

dotnet環境下從PDF文件中抽取Text文字的一些方法彙總 1.PDFBox的IKVM版本：據我所知，目前只有PDFBox的IKVM版本能比較好地從PDF中提取文字，PDFBOX更多資訊請訪問http://www.pdbox.org，關於其應用例項，可以參考CodeProject上的：http://w

C# 創建、讀取PDF文檔

C# .NET 創建、讀取PDF 類庫生成PDF文檔我們通常可以通過文檔轉換的形式來得到想要的PDF，但我們也可以通過最直接的方式來創建PDF文檔，由此本篇文章將介紹C#如何來創建帶圖、文元素的PDF文檔。同理，對於需要讀取PDF文檔的情況，我們也可以分情況來讀取想要的文檔元素（文本、圖片）

PHP讀取doc docx xls pdf txt內容

分享一下我老師大神的人工智慧教程！零基礎，通俗易懂！http://blog.csdn.net/jiangjunshow 也歡迎大家轉載本篇文章。分享知識，造福人民，實現我們中華民族偉大復興！

C#讀取中文PDF中的內容

從PDF中抓取相應的資訊是資料處理常見的一個操作。但是由於PDF現在都是由國外技術人員開發的，有可能存在漢字編碼問題。比如著名的 Spire.PDF 就對中文支援不好（如果有好的方法良好支援中文，歡迎告之，謝謝）。好在總有支援的第三方，iTextSharp的PDF轉換工具就可以很好地支援。

C++學習--向txt檔案寫入資料、讀取資料、獲取行數以及刪除檔案

基於VS2013平臺的程式碼如下： #include <iostream> #include <fstream> using namespace std; void main() { ofstream fout; int a = 15;

C# PDFbox讀取PDF內容

using System; using System.Windows.Forms; using org.apache.pdfbox.pdmodel; using org.apache.pdfbox.u

C++讀取txt資料出錯（亂碼、資料出錯、檔案打不開）

筆者最近用C++讀取txt檔案，遇到了讀取資料亂碼或資料錯誤、檔案打不開的問題，現將其中的坑寫下，供參考。1、讀資料亂碼或結果錯誤編碼方式有誤將導致讀資料亂碼或結果錯誤（筆者在出錯時結果為0）。亂碼是由於編碼方式導致的，可以用notepad++開啟將編碼格式轉為ANSI，如下

讀取檔案（.txt、.excel、.csv)，在c++、matlab環境中

1.開始時準備在vsat010下通過C++讀取Excel，需要 ole等但方法總是配置不好。 2.最後做專案發現，對於excel檔案其實最好的處理方式是：用讀取.csv，因為它們可以互相轉化只是在csv檔案中每個量直接由逗號隔開，也可以直接讀.。在此程式碼栗子分享

C#讀取PDF、TXT內容

相關推薦