>>35 Þ sorts differently in different languages.
> Now if you were just blindly manipulating strings, as you suggest, you would have a problem,
Read my post again. I didn't say anything about blindly doing anything: I actually said the exact opposite.
>>> print sorted(file("test.txt","r").readlines())
['PAGAN\n', 'PORN\n', 'PRALINE\n', 'TANK\n', 'THEME\n', 'TITMOUSE\n', 'YARN\n', 'YOUHOU\n', 'ZEBRA\n', '\xc3\x9eORN\n']
What's the encoding of this file again? I know, let's assume utf-8!
>>> for i in file("test.txt","r").readlines(): print i.decode('utf-8')
Well that seems to work. Let's just hope users never actually control the contents of test.txt:
>>> for i in file("test.txt","r").readlines(): print i.decode('utf-8')
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data
It happens all the time. People do a lot of work, then some user posts actual unicode where it isn't expected and through the magic of transcoding the entire database is hosed.