1. 程式人生 > >PHP讀取doc docx xls pdf txt內容

PHP讀取doc docx xls pdf txt內容

分享一下我老師大神的人工智慧教程!零基礎,通俗易懂!http://blog.csdn.net/jiangjunshow

也歡迎大家轉載本篇文章。分享知識,造福人民,實現我們中華民族偉大復興!

               

我的一個客戶有這樣的需求:上傳檔案,可以是doc,docx,xls,pdf,txt格式,現需要用php讀取這些檔案的內容,然後計算檔案裡面字數.

1.PHP讀取DOC格式的檔案

      PHP沒有自帶讀取word檔案的類,或者是庫,這裡我們使用

antiword(http://www.winfield.demon.nl/)這個包來讀取doc檔案.

     首先介紹一下如何在windows下使用:

      1.開啟http://www.winfield.demon.nl/(antiword下載頁面),找到對應的windows版本(http://www.winfield.demon.nl/#Windows),下載antiword windows版本(antiword-0_37-windows.zip);

      2.將下載下來的檔案解壓到C盤根目錄下;

這裡還有一點需要注意的:http://www.informatik.uni-frankfurt.de/~markus/antiword/00README.WIN這個連線裡有windows下安裝的說明檔案.

  需要設定環境變數,我的電腦(右鍵)->高階->環境變數->在上面的使用者變數裡新建一個

  變數名:HOME

  變數值:c:\home這個目錄應該是存在的,如果不存在就在C盤下建立一個home資料夾.

然後在系統變數,修改Path,在Path變數的值最前面加上%HOME%\antiword.

 

      3.開始->執行->CMD 進入到antiword目錄;

      輸入 antiword -h 看看效果.

 

   4.然後我們使用antiword –t 命令讀取一下doc檔案內容;首先複製一個doc檔案到c:\antiword目錄,然後執行

   >antiword –t 檔名.doc

   就可以看到螢幕上輸出word檔案的內容了.

可能你會問了,這和PHP讀取word有什麼關係呢?呵呵,別急,我們來看看如何在PHP裡使用這個命令.

  <?php

  $file = “D:\xampp\htdocs\word_count\uploads\doc-english.doc”;

   $content = shell_exec(“c:\antiword\antiword –f $file”);

?>

這樣就把word裡面的內容讀取content裡面了.

至於如何在Linux下讀取doc檔案內容,就是下載linux版本的壓縮包,裡面有readme.txt檔案,按照那種方式安裝就可以了.

$content = shell_exec ( "/usr/local/bin/antiword -f $file" );

2.PHP讀取PDF檔案內容

    php也沒有專門用來讀取pdf內容的類庫.這樣我們採用第三方包(xpdf).還是先做windows下的操作,下載,將其解壓到C盤根目錄下.

   開始->執行->cmd->cd /d c:\xpdf 
<?php

   $file = “D:\xampp\htdocs\word_count\uploads\pdf-english.pdf”;

    $content = shell_exec ( "c:\\xpdf\\pdftotext $file -" );

   ?>

這樣就可以把pdf檔案的內容讀取到php變數裡了.

Linux下的安裝方法也很簡單這裡就不在一一列出

<?php

$content = shell_exec ( "/usr/bin/pdftotext $file -" );

?>

3.PHP讀取ZIP檔案內容

首先使用PHP zip解壓zip檔案,然後讀取解壓包裡的檔案,如果是word就採用antiword讀取,如果是pdf就使用xpdf讀取.

<?php

/** 
* Read ZIP valid file 

* @param string $file file path 
* @return string total valid content 
*/ 
function ReadZIPFile($file = '') { 
    $content = ""; 
    $inValidFileName = array (); 
    $zip = new ZipArchive ( ); 
    if ($zip->open ( $file ) === TR ) { 
        for($i = 0; $i < $zip->numFiles; $i ++) { 
            $entry = $zip->getNameIndex ( $i ); 
            if (preg_match ( '#\.(txt)|\.(doc)|\.(docx)|\.(pdf)$#i', $entry )) { 
                $zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ), array ( 
                        $entry 
                ) ); 
                $content .= CheckSystemOS ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry ); 
            } else { 
                $inValidFileName [$i] = $entry; 
            } 
        } 
        $zip->close (); 
        rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) ); 
        /*if (file_exists ( $file )) { 
            unlink ( $file ); 
        }*/ 
        return $content; 
    } else { 
        return ""; 
    } 
}

?>

4.PHP讀取DOCX檔案內容

  docx檔案其實是由很多XML檔案組成,其中內容就存在於word/document.xml裡面.

   我們找到一個docx檔案,使用zip檔案開啟(或者把docx字尾名改為zip,然後解壓)

 

在word目錄下有document.xml

docx檔案的內容就存在於document.xml裡面,我們讀取這個檔案就可以了.

<?php

/** 
* Read Docx File 

* @param string $file filepath 
* @return string file content 
*/ 
function parseWord($file) { 
    $content = ""; 
    $zip = new ZipArchive ( ); 
    if ($zip->open ( $file ) === tr ) { 
        for($i = 0; $i < $zip->numFiles; $i ++) { 
            $entry = $zip->getNameIndex ( $i ); 
            if (pathinfo ( $entry, PATHINFO_BASENAME ) == "document.xml") { 
                $zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ), array ( 
                        $entry 
                ) ); 
                $filepath = pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry; 
                $content = strip_tags ( file_get_contents ( $filepath ) ); 
                break; 
            } 
        } 
        $zip->close (); 
        rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) ); 
        return $content; 
    } else { 
        return ""; 
    } 
}

?>

如果想要通過PHP建立docx檔案,或者是把docx檔案轉為xhtml,pdf可以使用phpdocx,(http://www.phpdocx.com/)

 

5.PHP讀TXT

直接使用PHP file_get_content函式就可以了.

<?php

$file = “D:\xampp\htdocs\word_count\uploads\eng.txt”;

$content = file_get_content($file);

?>

6.PHP讀EXCEL

http://phpexcel.codeplex.com/

 

現在只是讀取檔案內容了,怎麼計算單詞的個數呢?

PHP有一個自帶的函式,str_word_count,這個函式可以計算出單詞的個數,但是如果要計算antiword讀取出來的doc檔案的單詞個數就會很大的誤差.

這裡我們使用以下這個函式專門用來讀取單詞個數 
<?php

/** 
* statistic word count 

* @param string $content word content of the file 
* @return int word count of the content 
*/ 
function StatisticWordsCount($text = '') { 
    //    $text = trim ( preg_replace ( '/\d+/', ' ', $text ) ); // remove extra spaces 
    $text = str_replace ( str_split ( '|' ), '', $text ); // remove these chars (you can specify more) 
    //    $text = str_replace ( str_split ( '-' ), '', $text ); // remove these chars (you can specify more) 
    $text = trim ( preg_replace ( '/\s+/', ' ', $text ) ); // remove extra spaces 
    $text = preg_replace ( '/-{2,}/', '', $text ); // remove 2 or more dashes in a row 
    $len = strlen ( $text ); 
    if (0 === $len) { 
        return 0; 
    } 
    $words = 1; 
    while ( $len -- ) { 
        if (' ' === $text [$len]) { 
            ++ $words; 
        } 
    } 
    return $words; 
}

?>

詳細的程式碼如下:

<?php 
/** 
* check system operation win or linux 

* @param string $file contain file path and file name 
* @return file content 
*/ 
function CheckSystemOS($file = '') { 
    $content = ""; 
    //    $type = s str ( $file, strrpos ( $file, '.' ) + 1 ); 
    $type = pathinfo ( $file, PATHINFO_EXTENSION ); 
    //    global $UNIX_ANTIWORD_PATH, $UNIX_XPDF_PATH; 
    if (strtoupper ( s str ( PHP_OS, 0, 3 ) ) === 'WIN') { //this is a server using windows 
        switch (strtolower ( $type )) { 
            case 'doc' : 
                $content = shell_exec ( "c:\\antiword\\antiword -f $file" ); 
                break; 
            case 'docx' : 
                $content = parseWord ( $file ); 
                break; 
            case 'pdf' : 
                $content = shell_exec ( "c:\\xpdf\\pdftotext $file -" ); 
                break; 
            case 'zip' : 
                $content = ReadZIPFile ( $file ); 
                break; 
            case 'txt' : 
                $content = file_get_contents ( $file ); 
                break; 
        } 
    } else { //this is a server not using windows 
        switch (strtolower ( $type )) { 
            case 'doc' : 
                $content = shell_exec ( "/usr/local/bin/antiword -f $file" ); 
                break; 
            case 'docx' : 
                $content = parseWord ( $file ); 
                break; 
            case 'pdf' : 
                $content = shell_exec ( "/usr/bin/pdftotext $file -" ); 
                break; 
            case 'zip' : 
                $content = ReadZIPFile ( $file ); 
                break; 
            case 'txt' : 
                $content = file_get_contents ( $file ); 
                break; 
        } 
    } 
    /*if (file_exists ( $file )) { 
        @unlink ( $file ); 
    }*/ 
    return $content; 
}

/** 
* statistic word count 

* @param string $content word content of the file 
* @return int word count of the content 
*/ 
function StatisticWordsCount($text = '') { 
    //    $text = trim ( preg_replace ( '/\d+/', ' ', $text ) ); // remove extra spaces 
    $text = str_replace ( str_split ( '|' ), '', $text ); // remove these chars (you can specify more) 
    //    $text = str_replace ( str_split ( '-' ), '', $text ); // remove these chars (you can specify more) 
    $text = trim ( preg_replace ( '/\s+/', ' ', $text ) ); // remove extra spaces 
    $text = preg_replace ( '/-{2,}/', '', $text ); // remove 2 or more dashes in a row 
    $len = strlen ( $text ); 
    if (0 === $len) { 
        return 0; 
    } 
    $words = 1; 
    while ( $len -- ) { 
        if (' ' === $text [$len]) { 
            ++ $words; 
        } 
    } 
    return $words; 
}

/** 
* Read Docx File 

* @param string $file filepath 
* @return string file content 
*/ 
function parseWord($file) { 
    $content = ""; 
    $zip = new ZipArchive ( ); 
    if ($zip->open ( $file ) === tr ) { 
        for($i = 0; $i < $zip->numFiles; $i ++) { 
            $entry = $zip->getNameIndex ( $i ); 
            if (pathinfo ( $entry, PATHINFO_BASENAME ) == "document.xml") { 
                $zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ), array ( 
                        $entry 
                ) ); 
                $filepath = pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry; 
                $content = strip_tags ( file_get_contents ( $filepath ) ); 
                break; 
            } 
        } 
        $zip->close (); 
        rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) ); 
        return $content; 
    } else { 
        return ""; 
    } 
}

/** 
* Read ZIP valid file 

* @param string $file file path 
* @return string total valid content 
*/ 
function ReadZIPFile($file = '') { 
    $content = ""; 
    $inValidFileName = array (); 
    $zip = new ZipArchive ( ); 
    if ($zip->open ( $file ) === TR ) { 
        for($i = 0; $i < $zip->numFiles; $i ++) { 
            $entry = $zip->getNameIndex ( $i ); 
            if (preg_match ( '#\.(txt)|\.(doc)|\.(docx)|\.(pdf)$#i', $entry )) { 
                $zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ), array ( 
                        $entry 
                ) ); 
                $content .= CheckSystemOS ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry ); 
            } else { 
                $inValidFileName [$i] = $entry; 
            } 
        } 
        $zip->close (); 
        rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) ); 
        /*if (file_exists ( $file )) { 
            unlink ( $file ); 
        }*/ 
        return $content; 
    } else { 
        return ""; 
    } 
}

/** 
* remove directory 

* @param string $dir path dir 
*/ 
function rrmdir($dir) { 
    if (is_dir ( $dir )) { 
        $objects = scandir ( $dir ); 
        foreach ( $objects as $object ) { 
            if ($object != "." && $object != "..") { 
                if (filetype ( $dir . "/" . $object ) == "dir") { 
                    rrmdir ( $dir . "/" . $object ); 
                } else { 
                    unlink ( $dir . "/" . $object ); 
                } 
            } 
        } 
        reset ( $objects ); 
        rmdir ( $dir ); 
    } 
}

 

 

//呼叫方法

$file = “D:\xampp\htdocs\word_count\uploads\pdf-german.zip”;

$word_number = StatisticWordsCount ( CheckSystemOS ( $file) );

?>





http://www.it300.com/article-15290.html

           

給我老師的人工智慧教程打call!http://blog.csdn.net/jiangjunshow

這裡寫圖片描述