Web Framework (97)

40 Name: #!/usr/bin/anonymous : 2008-02-05 20:06 ID:Heaven

>>37
AHow about learning a few things about the language before you start bashing it?

And since in this scenario we're taking user input, let's be a bit lenient with broken input data, too. Because, you know, Python allows you to do that.

>>> import codecs
>>> f = codecs.open('test.txt', encoding='utf-8')
>>> print sorted(f.readlines())

[u'PAGAN\n', u'PORN\n', u'PRALINE\n', u'TANK\n', u'THEME\n', u'TITMOUSE\n', u'TOUHOU\n', u'YARN\n', u'ZEBRA\n', u'\xdeORN\n']

Oh wow, imagine that. I got Unicode data out of it, without having to screw around with .decode() on every damn string.

Now supposing the character has a couple of broken characters in it, I could add errors='replace' to the open() call, and I'll get back Unicode data with the (standard) Unicode replacement character, instead of garbled crap. Not the ideal solution, but the ideal solution would be for nobody to have invalid characters in the first place. Ignoring broken characters doesn't make them go away, but handling them properly will, and as an added bonus, if you want to let your users know that their data might be corrupt, you can do that. Not so if you're just shoveling raw bit strings around.

And if you really have no idea what encoding a file is using, try this: http://chardet.feedparser.org/

> Read my post again. I didn't say anything about blindly doing anything: I actually said the exact opposite.

How is your statement -- "the way out ISN'T to just be more careful- to just try harder. It's to stop worrying about this crap altogether." -- not equivalent to "don't bother to handle character encodings"?

This thread has been closed. You cannot post in this thread any longer.