คือเนื่องจากเว็บหลักของบทความนี้ link เสียครับผมเลยใช้ google copy มาชั่วคราวก่อนนะครับหวังว่าคงไม่ว่ากันนะครับ
refer:http://www.reportlab.com/i18n/python..._tutorial.html
Unicode Tutorial
This is brief tutorial aimed at explaining the Unicode additions to Python. Please help me keep it up to date and accurate!
Why is Python getting Unicode support?
Once you get beyond the ASCII world, there are many different native encodings for different languages and operating systems. Converting between all of these is easiest with a central "common point", and that is Unicode. Unicode is a two-byte encoding which covers all of the world's common writing systems. It is important for many reasons:
Data Storage
If your customer database is all English, or even all Japanese, you can store it any way you like. But if you have to keep English, Japanese, Russian and Thai in the same file or database column, you can;t use a native encoding - you really need something like Unicode.
Encoding Conversion
If a new encoding needs to be added to a library, it is only necessary to establish a mapping to and from Unicode, and not to every other encoding in the world
Operations on wide characters
Asian languages have to use more tha one byte per character. Most native encodings use a mix of single bytes for ASCII, and two bytes per chinese character. Software that needs to slice strings can potentially cut a character in half. It is much, much easier to write string- processing operations in Unicode, where every character is the same width.
Operating System Compatibility
For the above reasons, operating systems and low-level APIs have been moving to support Unicode, and there are more and more functions around which expect Unicode strings as arguments, or which return them.
Installation and Setup
A Unicode-aware Python It is currently available in the public CVS repository; if you don't use CVS, you can get a nightly tar.gz or zip file.
If you are used to using a compiled binary distribution of Python, such as the one on Windows, you need to make some potentially destructive changes to your existing environment. Here are some guidelines: (TODO)
TODO - extract the diffs for the library - exceptions
overwrite python.exe and pythonw.exe in Crogram filespython with the ones in the zip file
if using Pythonwin, it won't start, as there is an illegal use of list.append() in the scintilla code, which Guido is going to ban for 1.6. Look in ..pythonpythonwinpywinscintillaview.py lines 71 and 73, and add the extra brackets so they look like this:
event_commands.append((event, val))
for name, id in _extra_event_commands:
event_commands.append((name, id))
After this change, Pythonwin should work again.
Viewing data in different encodings
To actually look at data in different encodings, the best tool is a web browser. You may not know it, but IE and Netscape can both display many common encodings (including Asian and Middle Eastern scripts, if you download the right fonts). The View | Encodings menu in IE5 controls how the current page is interpreted:
Let's imagine we have an HTML file containing the name of the author of the unicode extensions, encoded in ISO-latin-1. This contains an acute letter e, which is not available in ASCII. If you have this encoding selected in the browser, and have any Asian fonts installed, you should see this:
If you now go and select (say) UTF8, the bytes will be interpreted differently and you will see this instead:
That's because the three bytes from the acute-e to the 'L' in Lemburg are UTF8 for a Chinese character. (If you don't have the fonts installed, you'll see a round blob, which is the generic 'I don't know how to display this' symbol in IE5.)
Seen from a long distance away, the Unicode extensions are largely about how to prevent this kind of thing: letting you explicitly control the encodings of the files you work with, and converting between them as needed. Unicode itself is an internal technology to make this easier.
Basics about Unicode strings
Creating Unicode Strings
We'll run through a few snippets. The first is to look at the ways of creating Unicode strings. You can convert ASCII text to Unicode with a literal notation, prefixing a 'u' before the string. Unicode strings are printed to the console with a preceding 'u'.
>>> u"Hello World!" #create a Unicode string
u'Hello World!'
To construct the string, Python assumed that the literal input was in UTF8, the "default encoding". UTF8 is a way of encoding Unicode such that the basic ASCII characters remain themselves; most other single-byte writing systems end up as two bytes; and Chinese characters end up as three bytes.
Python 1.6 also gets a "unicode" built-in function, to which you can specify the encoding:
>>> unicode('hello')
u'hello'
>>> unicode('hello', 'ascii')
u'hello'
>>> unicode('hello', 'iso-8859-1')
u'hello'
>>>
All three of these return the same thing, since the characters in 'Hello' are common to all three encodings.
Now let's encode something with a European accent, which is outside of ASCII. What you see at a console may depend on your operating system locale; Windows lets me type in ISO-Latin-1.
>>> a = unicode('Andr