1. 程式人生 > >【原創】python encoding中文編碼

【原創】python encoding中文編碼

看一下python的字元編碼,我琢磨了半天,這個好像沒什麼用啊,無論設定哪種編碼方式,結果都一樣的。

設定方式如下:

pythonlib目錄下site-packages目錄中,新建sitecustomize.py

C:\Python27\lib\site-pachages\sitecustomize.py

輸入以下內容,儲存關閉。

# this file can be anywhere in your Python path,

# but it usually goes in ${pythondir}/lib/site-packages/

import sys

sys.setdefaultencoding('iso-8859-1')#

分別嘗試了ascii(預設)UTF-8,gb2312

每設定完後重新執行Python IDE

結果如下:

一、iso-8859-1

>>> import sys

>>> sys.getdefaultencoding()

'iso-8859-1'

>>> s=u'我是中國人'

>>> s

u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'

>>> print s

ÎÒÊÇÖйúÈË

>>> s='我是中國人'

>>> s

'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'

>>> print s

我是中國人

>>> 

二、ascii 預設編碼方式,可以不用新建sitecustomize.py

>>> import sys

>>> sys.getdefaultencoding()

'ascii'

>>> s=u'我是中國人'

>>> s

u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'

>>> print s

ÎÒÊÇÖйúÈË

>>> s='我是中國人'

>>> s

'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'

>>> print s

我是中國人

>>> 

三、UTF-8

>>> import sys

>>> sys.getdefaultencoding()

'UTF-8'

>>> s=u'我是中國人'

>>> s

u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'

>>> print s

ÎÒÊÇÖйúÈË

>>> s='我是中國人'

>>> s

'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'

>>> print s

我是中國人

>>> 

四、gb2312

>>> import sys

>>> sys.getdefaultencoding()

'gb2312'

>>> s=u'我是中國人'

>>> s

u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'

>>> print s

ÎÒÊÇÖйúÈË

>>> s='我是中國人'

>>> s

'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'

>>> print s

我是中國人

>>> 

發現沒,他們輸出的結果都一樣這讓我表示鬱悶,那設定這個有什麼用嗎?

按照書上的說法,設定預設的編碼後,可以這樣來用。

>>> s=u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb' #正好對應‘我是中國人’

>>> s

u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'

>>> print s

ÎÒÊÇÖйúÈË

>>> 

但是四種編碼方式都一樣的,這結果讓我更不知所措了

剛才試著讀中文格式的xml,報錯了,xml文字如下:

<?xml version="1.0" encoding="gb2312"?>      

<preface>

<title>我是中國人</title>                   

</preface>

Python IDE如下:

>>> from xml.dom import minidom

>>> xmldoc=minidom.parse('./mytest/russiansample.xml')

Traceback (most recent call last):

  File "<pyshell#1>", line 1, in <module>

    xmldoc=minidom.parse('./mytest/russiansample.xml')

  File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse

    return expatbuilder.parse(file)

  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse

    result = builder.parseFile(fp)

  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile

    parser.Parse(buffer, 0)

ExpatError: not well-formed (invalid token): line 3, column 10

>>> xmldoc=minidom.parse('./mytest/russiansample.xml')

>>> 

這是什麼問題,我也不懂了,然後想了想一般編碼不區分大小寫的,但是抱著試一試的心態把xml中的編碼方式改了一下,改成GB2312,接著讀xml

>>> xmldoc=minidom.parse('./mytest/russiansample.xml')

>>> 

額,這是為什麼,沒有異常了。

既然這樣,那就這樣吧,也只能這樣了,以後大家寫xml或者html或者其他地方要寫編碼儘量用大寫吧!

不過接著就出問題了,試著把讀取到的東西輸出:

>>> xmldoc=minidom.parse('./mytest/russiansample.xml')

>>> title=xmldoc.getElementsByTagName_r('title')[0].firstChild.data

>>> title

u'\u6445\u646e\ufae0$\u6563\u726f\u876e\u5f73\u1a50\u01fe'

>>> print title

攄摮$散牯蝮彳Ǿ

【釋】這....又是亂碼!

要不試著將文字換換編碼再輸出?

>>> converttitle=title.encode('GB2312')

Traceback (most recent call last):

  File "<pyshell#6>", line 1, in <module>

    converttitle=title.encode('GB2312')

UnicodeEncodeError: 'gb2312' codec can't encode character u'\u646e' in position 1: illegal multibyte sequence

>>> converttitle=title.encode('UTF-8')

>>> converttitle

'\xe6\x91\x85\xe6\x91\xae\xef\xab\xa0$\xe6\x95\xa3\xe7\x89\xaf\xe8\x9d\xae\xe5\xbd\xb3\xe1\xa9\x90\xc7\xbe'

>>> print converttitle

攄摮$散牯蝮彳Ǿ

>>> a='我是中國人'

>>> a

'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'

>>> converttitle=title.encode('gb2312')

Traceback (most recent call last):

  File "<pyshell#12>", line 1, in <module>

    converttitle=title.encode('gb2312')

UnicodeEncodeError: 'gb2312' codec can't encode character u'\u646e' in position 1: illegal multibyte sequence

>>> converttitle=title.encode('ascii')

Traceback (most recent call last):

  File "<pyshell#13>", line 1, in <module>

    converttitle=title.encode('ascii')

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

>>> converttitle=title.encode('iso-8859-1')

Traceback (most recent call last):

  File "<pyshell#14>", line 1, in <module>

    converttitle=title.encode('iso-8859-1')

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)

>>> 

這怎麼辦,只有utf-8 能輸出,但是還是亂碼昂..這個問題,有待研究呀。

還有個問題,xml還是GB2312,使用這個使用者配置編碼sitecustomize.py 分別設定為iso-8859-1 ascii utf-8 GB2312 ,對xml進行讀取。

>>> import sys

>>> sys.getdefaultencoding()

'GB2312'

>>> from xml.dom import minidom

>>> xmldoc = minidom.parse('./mytest/russiansample.xml')

Traceback (most recent call last):

  File "<pyshell#3>", line 1, in <module>

    xmldoc = minidom.parse('./mytest/russiansample.xml')

  File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse

    return expatbuilder.parse(file)

  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse

    result = builder.parseFile(fp)

  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile

    parser.Parse(buffer, 0)

ExpatError: unknown encoding: line 1, column 30

>>> 

>>> import sys

>>> sys.getdefaultencoding()

'UTF-8'

>>> from xml.dom import minidom

>>> xmldoc = minidom.parse('./mytest/russiansample.xml')

Traceback (most recent call last):

  File "<pyshell#3>", line 1, in <module>

    xmldoc = minidom.parse('./mytest/russiansample.xml')

  File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse

    return expatbuilder.parse(file)

  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse

    result = builder.parseFile(fp)

  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile

    parser.Parse(buffer, 0)

ExpatError: not well-formed (invalid token): line 3, column 10

>>> 

>>> import sys

>>> sys.getdefaultencoding()

'ascii'

>>> from xml.dom import minidom

>>> xmldoc = minidom.parse('./mytest/russiansample.xml')

Traceback (most recent call last):

  File "<pyshell#3>", line 1, in <module>

    xmldoc = minidom.parse('./mytest/russiansample.xml')

  File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse

    return expatbuilder.parse(file)

  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse

    result = builder.parseFile(fp)

  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile

    parser.Parse(buffer, 0)

ExpatError: not well-formed (invalid token): line 3, column 10

>>> 

>>> import sys

>>> sys.getdefaultencoding()

'iso-8859-1'

>>> from xml.dom import minidom

>>> xmldoc = minidom.parse('./mytest/russiansample.xml')

Traceback (most recent call last):

  File "<pyshell#3>", line 1, in <module>

    xmldoc = minidom.parse('./mytest/russiansample.xml')

  File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse

    return expatbuilder.parse(file)

  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse

    result = builder.parseFile(fp)

  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile

    parser.Parse(buffer, 0)

ExpatError: unknown encoding: line 1, column 30

>>> 

【釋】發現沒有,UTF-8ASCII 會報ExpatError: not well-formed (invalid token): line 3, column 10 的異常,這不是之前大小寫的問題嗎?那為什麼GB2312iso-8859-1 會報ExpatError: unknown encoding: line 1, column 30的錯誤呢?要不把這個配置的編碼刪掉再試試:

>>> import sys

>>> sys.getdefaultencoding()

'ascii'

>>> from xml.dom import minidom

>>> xmldoc=minidom.parse('./mytest/russiansample.xml')

Traceback (most recent call last):

  File "<pyshell#3>", line 1, in <module>

    xmldoc=minidom.parse('./mytest/russiansample.xml')

  File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse

    return expatbuilder.parse(file)

  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse

    result = builder.parseFile(fp)

  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile

    parser.Parse(buffer, 0)

ExpatError: unknown encoding: line 1, column 30

>>> 

錯了錯了,顛覆了之前大小寫的原因了。而且倫亂了,無論gb2312還是GB2312都是unknown encoding了,未知編碼方式

再試試,配置編碼留著,但是什麼都不做,只是import sys,下面#註釋,IDE結果如下:

>>> import sys

>>> sys.getdefaultencoding()

'ascii'

>>> from xml.dom import minidom

Traceback (most recent call last):

  File "<pyshell#3>", line 1, in <module>

    xmldoc=minidom.parse('./mytest/russiansample.xml')

  File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse

    return expatbuilder.parse(file)

  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse

    result = builder.parseFile(fp)

  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile

    parser.Parse(buffer, 0)

ExpatError: not well-formed (invalid token): line 3, column 10

這不就是那個之前大小寫時候的錯誤嗎?哎,弄大小寫了,沒用….啊呀,那這個到底怎麼讀xml內容啊!難道前面成功讀取一次的只是巧合嗎?

額,倫亂了倫亂了,不過我試著重新拷貝了一個qqxml SSOConfig.xml,然後修改如下:

<?xml version="1.0" encoding="utf-8" ?>

<i18n>

         <StringBundle>

地區資訊,目前只需要一個, SSOPlatform不需要地區資訊

         </StringBundle>

</i18n>

>>> xmldoc=minidom.parse('./mytest/SSOConfig.xml')

>>> 

擦,終於可以了,然後試著把這段話複製到russiansample.xml 中:

<?xml version="1.0" encoding="utf-8" ?>

<i18n>

         <StringBundle>

地區資訊,目前只需要一個, SSOPlatform不需要地區資訊

         </StringBundle>

         <preface>

                   <title>

我是中國人

                   </title>

         </preface>

</i18n>

這個xml對吧?我認為沒有問題,但是還是一樣的異常,一模一樣的檔案內容,就是名字不一樣就會報錯嗎,我不信了,終於我發現問題了,告訴你們一個很重要的資訊,那就是檔案編碼格式!!這個問題糾結了很久,來試試看吧!開啟russiansample.xml 另存為-編碼預設是ANSI選擇UTF-8,儲存並替換。

>>> xmldoc=minidom.parse('./mytest/russiansample.xml')

>>> xmldoc.getElementsByTagName_r('title')

[<DOM Element: title at 0x21f3328>]

>>> title = xmldoc.getElementsByTagName_r('title')[0].firstChild.data

>>> title

u'\n\t\t\u5730\u533a\u4fe1\u606f\uff0c\u76ee\u524d\u53ea\u9700\u8981\u4e00\u4e2a, SSOPlatform\u4e0d\u9700\u8981\u5730\u533a\u4fe1\u606f\n\t'

>>> print title

地區資訊,目前只需要一個, SSOPlatform不需要地區資訊

>>> c=title.encode('gb2312')

>>> c

'\n\t\t\xb5\xd8\xc7\xf8\xd0\xc5\xcf\xa2\xa3\xac\xc4\xbf\xc7\xb0\xd6\xbb\xd0\xe8\xd2\xaa\xd2\xbb\xb8\xf6, SSOPlatform\xb2\xbb\xd0\xe8\xd2\xaa\xb5\xd8\xc7\xf8\xd0\xc5\xcf\xa2\n\t'

>>> print c

地區資訊,目前只需要一個, SSOPlatform不需要地區資訊

>>> 

終於成功了,而且不需要再轉碼輸出了,我不要再試了。最後再說一句,檔案編碼方式很重要,這個尤其的windows上!

本人親測:xml中的encodingUTF-8的時候,檔案儲存格式一定要是utf-8.這樣直接開啟IDE就可以讀取xml

另附:QQ好像所有的xml檔案都是utf-8編碼和儲存的;百度好像大部分是gb2312

自從那一次大小寫的問題讀出了gb2312xml後,目前為止再也沒有碰到過,哪怕結果是亂碼也沒有,都是異常。儘量用utf-8吧,基本可以解決一切xml編碼問題。

9/15/2013 17:57:47

原創所有,轉載請附加本文連結,謝謝!