linux 下使用 python 和 pdfkit 來轉換 html 為 pdf

HTML Python Linux · 發表 2018-12-04 14:34:14

摘要：前言在前面，我們已經演示過如何下載 html 頁面內容，並且通過 jsoup 來解析 html 的內容。那麼現在我們又想將文章的正文內容轉換成為 pdf 。經過查詢，發現用 python 配合 pdfkit 包來轉換是一個非常方便和簡潔的方式。這裡我們就講述一下如何在 linux 伺服器...

前言

在前面，我們已經演示過如何下載 html 頁面內容，並且通過 jsoup 來解析 html 的內容。那麼現在我們又想將文章的正文內容轉換成為 pdf 。經過查詢，發現用 python 配合 pdfkit 包來轉換是一個非常方便和簡潔的方式。這裡我們就講述一下如何在 linux 伺服器上使用 python 和 pdfkit 將 html 轉換為 pdf 檔案

環境準備

python 安裝

首先我們要提前安裝依賴的元件

yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make

然後下載 python3.6 的最新版本 3.6.7 的原始碼並進行編譯安裝

wget https://www.python.org/ftp/python/3.6.7/Python-3.6.7.tgz
tar -xvf Python-3.6.7.tgz 
tar -xvf Python-3.6.7.tgz 
cd Python-3.6.7
./configure prefix=/usr/local/python3.6.7--enable-optimizations
make && make install

安裝中，執行測試用例的過程比較慢，需要耐心等待一會。

安裝 pdfkit 支援

pdfkit 實際上是封裝了 wkhtmltox 元件的功能。所有首先要安裝這個元件。安裝之前同樣要先安裝他的依賴

yum install *xorg-x11-fonts*

wget https://downloads.wkhtmltopdf.org/0.12/0.12.5/wkhtmltox-0.12.5-1.centos7.x86_64.rpm
rpm -ivh wkhtmltox-0.12.5-1.centos7.x86_64.rpm

如果 linux 上沒有安裝中文字型，那麼處理 html 中的中文字元將會有問題。所以需要在 linux 上安裝中文字型。首先在一臺安裝了中文字型的機器上找到需要安裝的字型檔案，我在自己的 Windows10 的目錄 C:\Windows\fonts 下找了宋體對應的字型檔案 simsun.ttc 並拷貝到 centos7 的字型檔案目錄 /usr/share/fonts 目錄中即可

最後需要安裝 python3 的 pdfkit 模組

/usr/local/python3.6.7/bin/pip3 install pdfkit

編寫程式並執行

轉換一個 URL 地址

例如，我要把我的一篇文章 ofollow,noindex">《在微服務中進行日誌跟蹤的理論基礎》這篇文章轉換為 pdf ，則編寫 urltopdf.py 程式如下

import pdfkit

html_url = "https://www.jianshu.com/p/5476602b6e25" #文章地址
pdf_file = "url.pdf"
pdfkit.from_url(html_url, pdf_file)

然後執行命令

/usr/local/python3.6.7/bin/python3 urltopdf.py
Loading pages (1/6)
QFont::setPixelSize: Pixel size <= 0 (0)] 47%
QFont::setPixelSize: Pixel size <= 0 (0)=====>] 74%
Counting pages (2/6)
QFont::setPixelSize: Pixel size <= 0 (0)=====================] Object 1 of 1
Resolving links (4/6)
Loading headers and footers (5/6)
Printing pages (6/6)
QFont::setPixelSize: Pixel size <= 0 (0)=====================] Page 5 of 5
Done

她會忽略掉一些錯誤，然後生成 url.pdf 檔案，結果如下圖所示

URL 生成的 PDF 檔案

轉換本地 html

上面的結果中，我們會發現，程式將整個 HTML 頁面都生成為了 PDF ，包括我們不想要的一些頭部、尾部還有一些廣告相關的。如果只想要文章主題部分，那麼我們可以只要真個網頁中的文章主題部分即可。在下圖中，可以看出，簡書的文章內容是在 class 為 article 的 div 中，我們在將這部分內容複製下來，儲存到檔案 article.html 中

從瀏覽器中裁剪文章內容

然後執行程式

/usr/local/python3.6.7/bin/python3 filetopdf.py 
Traceback (most recent call last):
File "filetopdf.py", line 5, in <module>
pdfkit.from_file(html_file, pdf_file)
File "/usr/local/python3.6.7/lib/python3.6/site-packages/pdfkit/api.py", line 49, in from_file
return r.to_pdf(output_path)
File "/usr/local/python3.6.7/lib/python3.6/site-packages/pdfkit/pdfkit.py", line 156, in to_pdf
raise IOError('wkhtmltopdf reported an error:\n' + stderr)
OSError: wkhtmltopdf reported an error:
Loading pages (1/6)
Error: Failed to load file://upload.jianshu.io/users/upload_avatars/9436466/c26dd766-91bc-448f-9cff-8802923531a7.jpg?imageMogr2/auto-orient/strip|imageView2/1/w/96/h/96, with network status code 302 and http status code 0 - Request for opening non-local file file://upload.jianshu.io/users/upload_avatars/9436466/c26dd766-91bc-448f-9cff-8802923531a7.jpg?imageMogr2/auto-orient/strip|imageView2/1/w/96/h/96
Error: Failed to load file://upload-images.jianshu.io/upload_images/9436466-d8bc801d86bcd7e6.png?imageMogr2/auto-orient/strip|imageView2/2/w/583/format/webp, with network status code 302 and http status code 0 - Request for opening non-local file file://upload-images.jianshu.io/upload_images/9436466-d8bc801d86bcd7e6.png?imageMogr2/auto-orient/strip|imageView2/2/w/583/format/webp
Error: Failed to load file://upload-images.jianshu.io/upload_images/9436466-22063aaf10a835da.png?imageMogr2/auto-orient/strip|imageView2/2/w/525/format/webp, with network status code 302 and http status code 0 - Request for opening non-local file file://upload-images.jianshu.io/upload_images/9436466-22063aaf10a835da.png?imageMogr2/auto-orient/strip|imageView2/2/w/525/format/webp
Error: Failed to load file://upload-images.jianshu.io/upload_images/9436466-46b2fc767f4bfca1.png?imageMogr2/auto-orient/strip|imageView2/2/w/396/format/webp, with network status code 302 and http status code 0 - Request for opening non-local file file://upload-images.jianshu.io/upload_images/9436466-46b2fc767f4bfca1.png?imageMogr2/auto-orient/strip|imageView2/2/w/396/format/webp
Counting pages (2/6)
Resolving links (4/6)
Loading headers and footers (5/6)
Printing pages (6/6)
Done
Exit with code 1 due to network error: ProtocolInvalidOperationError

可以看到有一些錯誤，但是檔案還是生成了。但是都是亂碼。這是因為我們儲存在檔案中的是 html 程式碼的片段，沒有指明用什麼編碼格式來處理檔案。所以出現的問題。那麼我們就手動編輯一下檔案，將 html 的頭部加上如下的內容

<!DOCTYPE html><html>
<head>
<meta charset="UTF-8"><title>Title</title>
</head>
<body>

尾部加上如下內容

</body>
</html>

形成完整的 html 結構，並指明字符集為 UTF-8 。重新執行後結果如下

第一版 PDF

可以看到文字顯示正常了，但是作者頭像，還有文章中的插圖沒有。這是因為我們儲存的 html 中圖片相關地址是一個相對地址，pdfkit 讀取的時候無法讀取到對應的內容，所以圖片部分變成了空白。前面執行的日誌也可以比較清楚的看出來。我們只要將 html 檔案中的地址補充完整即可。在這裡將所有 img 標籤裡面的 src 中的地址前面加上 https: 即可。然後重新執行命令

/usr/local/python3.6.7/bin/python3 filetopdf.py 
Loading pages (1/6)
Counting pages (2/6)
Resolving links (4/6)
Loading headers and footers (5/6)
Printing pages (6/6)
Done

可以看到執行過程中已經沒有錯誤資訊了。作者頭像也已經顯示。但是內容中的圖片還是沒有顯示出來。懷疑是影象地址 https://upload-images.jianshu.io/upload_images/9436466-d8bc801d86bcd7e6.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/583/format/webp 中 png 後面那一串導致的。去掉後， PDF 中雖然有圖片了。但是圖片位置不對。而且頁面中的佈局也和原頁面不一樣。原因也很簡單，因為這裡我們並沒有載入原網頁中的樣式表。所以我們將原網頁中的外部樣式表文件和內部的樣式資訊都從原網頁中拷貝到 article.html 的對應位置，然後重新執行，結果和原網頁差別就不大了。結果如下

PDF 最終版

這樣，我們就用幾行程式碼就實現了 html 到 pdf 的轉換，的確是非常的方便。強烈推薦

如果覺得本文有用，請多多點贊支援。