【Python開發】Url中文字元時記得轉碼edcode("utf-8")
阿新 • • 發佈:2019-01-23
在url中使用中文其實是一個壞習慣,會帶來一系列的轉碼問題, 我更喜歡英文譯名或者id來標識某個uri。但是現實往往是殘酷的, 特別是在我們呼叫別人服務時候,有時候被逼無奈使用中文URL。
Python中unicode轉碼一向是讓人頭疼的問題。數次碰壁之後,我也摸出了一些門道, 研讀完Python字串的encode與decode 之後,就自認為找到了萬金油,誰知道這次又碰上這個老冤家。
01 |
Traceback (most recent call last): |
02 |
File "<stdin>" , line 1 , in <module> |
03 |
File "/usr/lib/python2.6/urllib2.py" , line 126 , in urlopen |
04 |
return _opener. open (url, data, timeout) |
05 |
File "/usr/lib/python2.6/urllib2.py" , line 391 , in open |
06 |
response = self ._open(req, data) |
07 |
File "/usr/lib/python2.6/urllib2.py" , line 409 , in _open |
08 |
'_open' , req) |
09 |
File "/usr/lib/python2.6/urllib2.py" , line 369 , in _call_chain |
10 |
result = func( * args) |
11 |
File "/usr/lib/python2.6/urllib2.py" , line 1170 , in http_open |
12 |
return self .do_open(httplib.HTTPConnection, req) |
13 |
File "/usr/lib/python2.6/urllib2.py" , line 1142 , in do_open |
14 |
h.request(req.get_method(), req.get_selector(), req.data, headers) |
15 |
File "/usr/lib/python2.6/httplib.py" , line 914 , in request |
16 |
self ._send_request(method, url, body, headers) |
17 |
File "/usr/lib/python2.6/httplib.py" , line 951 , in _send_request |
18 |
self .endheaders() |
19 |
File "/usr/lib/python2.6/httplib.py" , line 908 , in endheaders |
20 |
self ._send_output() |
21 |
File "/usr/lib/python2.6/httplib.py" , line 780 , in _send_output |
22 |
self .send(msg) |
23 |
File "/usr/lib/python2.6/httplib.py" , line 759 , in send |
24 |
self .sock.sendall( str ) |
25 |
File "<string>" , line 1 , in sendall |
26 |
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7 - 8 : ordinal not in range ( 128 ) |
這次錯誤引發是在 urlopen() 引起的,很有特色,開始使用 url.encode('utf-8') 就可以解決了。 今天我做了一些測試。
1. ascii + unicode 測試
01 |
>>> 'a' + u 'b' |
02 |
>>> '你' + u '好' |
03 |
Traceback (most recent call last): |
04 |
File "<stdin>" , line 1 , in <module> |
05 |
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0 : ordinal not in range ( 128 ) |
06 |
>>> u '你' + u '好' |
07 |
u '\u4f60\u597d' |
08 |
>>> u 'a' + '你' + u '好' |
09 |
Traceback (most recent call last): |
10 |
File "<stdin>" , line 1 , in <module> |
11 |
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0 : ordinal not in range ( 128 ) |
上面的測試說明ascii碼和unicode碼相連操作,結論是有中文記得帶上u,就不會有問題。 Python預設解碼器是ascii,無法解碼unicode中的中文。
2. urllib2的測試
01 |
>>> import urllib2 |
03 |
>>> urllib2.urlopen(h1) |
04 |
<addinfourl at 153439532 whose fp = <socket._fileobject object at 0xb74e51ac >> |
06 |
>>> urllib2.urlopen(h2) |
07 |
<addinfourl at 153440236 whose fp = <socket._fileobject object at 0x925912c >> |
09 |
>>> urllib2.urlopen(h3) |
10 |
<addinfourl at 153482348 whose fp = <socket._fileobject object at 0x92593ac >> |
12 |
>>> urllib2.urlopen(h4) |
13 |
Traceback (most recent call last): |
14 |
File "<stdin>" , line 1 , in <module> |
15 |
File "/usr/lib/python2.6/urllib2.py" , line 126 , in urlopen |
16 |
return _opener. open (url, data, timeout) |
17 |
File "/usr/lib/python2.6/urllib2.py" , line 391 , in open |
18 |
response = self ._open(req, data) |
19 |
File "/usr/lib/python2.6/urllib2.py" , line 409 , in _open |
20 |
'_open' , req) |
21 |
File "/usr/lib/python2.6/urllib2.py" , line 369 , in _call_chain |
22 |
result = func( * args) |
23 |
File "/usr/lib/python2.6/urllib2.py" , line 1170 , in http_open |
24 |
return self .do_open(httplib.HTTPConnection, req) |
25 |
File "/usr/lib/python2.6/urllib2.py" , line 1142 , in do_open |
26 |
h.request(req.get_method(), req.get_selector(), req.data, headers) |
27 |
File "/usr/lib/python2.6/httplib.py" , line 914 , in request |
28 |
self ._send_request(method, url, body, headers) |
29 |
File "/usr/lib/python2.6/httplib.py" , line 951 , in _send_request |
30 |
self .endheaders() |
31 |
File "/usr/lib/python2.6/httplib.py" , line 908 , in endheaders |
32 |
self ._send_output() |
33 |
File "/usr/lib/python2.6/httplib.py" , line 780 , in _send_output |
34 |
self .send(msg) |
35 |
File "/usr/lib/python2.6/httplib.py" , line 759 , in send |
36 |
self .sock.sendall( str ) |
37 |
File "<string>" , line 1 , in sendall |
38 |
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7 - 8 : ordinal not in range ( 128 ) |
這個測試說明, urllib2.urlopen() 可以接受ascii/unicode的英文,也可以接受ascii的中文, 但是一旦是unicode的中文url,就會報轉碼錯誤。
so,請儘量英文url,非要用中文,請記得轉碼。