Web Framework (97)

1 Name: #!/usr/bin/anonymous : 2008-01-14 05:50 ID:EkVUNjls

What should I use for my web application that isn't PHP? I would have gone straight to mod_perl a few years ago, but now that there is so many. Is there anything better?

2 Name: dmpk2k!hinhT6kz2E : 2008-01-14 06:03 ID:Heaven

There are so many -- good luck!

It really depends on what you want to do and what you want to invest. Do you seek good documentation, longevity, ease of deployment, scalability, or explicit control? Python and Ruby in particular have a small army of frameworks that are scattered all about these axes.

The two big ones at the moment appear to be Rails and Django. Other notables are Pylons, Catalyst and Seaside.

mod_perl was never a framework, but can be used as the basis of one. It's a bit like mod_php not being CakePHP.

3 Name: #!/usr/bin/anonymous : 2008-01-14 08:16 ID:Heaven

Ruby on Rails is nifty, but it's pretty slow.
Pylons is good. I really like it, and recommend it to anyone who wants a framework to take care of all the technical stuff but not give up control to it, but you really have to know some Python and be willing to write code in order to make decent use of it.

4 Name: #!/usr/bin/anonymous : 2008-01-14 09:34 ID:9ONFgUdb

>>3

>but you really have to know some Python and be willing to write code in order to make decent use of it.

Frameworks are just like Java, they create dumb programmers. Do your own damn work and learn more, get more controll while you can avoid potential problems in the framework easier and provide support for all your code to the client.

I liked what >>3 said about Pylons though, that's what a framework should do, if anything.

I don't like these big frameworks, it's like we're moving into a time where programmers just write do_guestbook() or do_messageboard(), they're not really evolving that way.

In my opinion, abstract languages or frameworks like these create dumbed down programmers, i wouldn't even call them programmers, i'd call them scripters.

5 Name: #!/usr/bin/anonymous : 2008-01-14 10:27 ID:EkVUNjls

Framework perhaps was the wrong word. What I meant was simply scripting/programing languages.

I'm looking at mod_perl mainly because of it's easy to use page caching functions, which PHP had no support for. Not even the larger projects (Mediawiki, Wordpress, etc) had correct support for conditional GETs.

But I'm willing to take a look at anything with good performance. BTW is there any benchmarks on all the current languages (Ruby, PHP, Python, Perl, etc)? There doesn't seem to be any that are horribly outdated.

Thanks.

6 Name: dmpk2k!hinhT6kz2E : 2008-01-14 17:17 ID:Heaven

> But I'm willing to take a look at anything with good performance.

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=all&lang=all

I decided to write a mini-framework for myself in Lua. Performance was only one reason though because in practice the largest bottleneck by far will be the database (and memcached if you use it), followed by templating.

I generally recommend FastCGI over mod_[whatever] because it isn't tied to a single HTTP server, has better isolation (thus you don't get threading issues like mod_php and MPM Worker in Apache2), and it scales just as well. However, I vaguely recall there being problems with Conditional GETs, so do some research on it first?

7 Name: #!/usr/bin/anonymous : 2008-01-14 17:25 ID:j/yqEPJq

>>5

You can drop Ruby off the list immediately if you're looking for speed, at least. The others are at least comparable in speed, depending of course on the particular task.

8 Name: #!/usr/bin/anonymous : 2008-01-14 21:19 ID:0Illg7NJ

web.py

9 Name: #!/usr/bin/anonymous : 2008-01-15 00:38 ID:Heaven

>>8
Is fugly. Aaron Schwartz halfassed his way through the code just so he could dump something off onto Reddit and make a lot of money on it, and then flake out and leave the Reddit guys on their own to maintain it. He did the same with a couple of other sites as well.

Perhaps one of web.py's most dubious benefits is you can rewrite your app without it fairly easily if you decide later that you don't need it. (Incidentally, I don't think Reddit is even using web.py anymore.)

10 Name: #!/usr/bin/anonymous : 2008-02-03 02:14 ID:fMQv8GHe

>>7

It's not Ruby, it's Rails. Repeat after me, Rails is a slow pig and Ruby is not. Have a look at Merb if you're genuinely interested and not just regurgitating some meme..

Anyway, even if Ruby itself were slower executing than Python and Perl, there are two major factors that preclude execution speed alone from eliminating Ruby: YARV and Caching.

I'll leave Ruby 1.9/2.0/YARV/Rubinius/JRuby for the truly interested, but just know that within 2008, Ruby's raw execution speed will be equal to or surpass that of Python.

Of course if you just cache properly in the first place, execution speed won't mean dick, and then you can spend your time optimizing your web (nginx++) and proxy servers or the like.

So.. with all that said, Ruby really is a fucking blast to develop with. :-)

11 Name: #!/usr/bin/anonymous : 2008-02-03 02:51 ID:VBqqpniK

ARC is wonderful. I'm wary of new lisps because I've been spoiled by the good ones, but ARC is positively wonderful. Think of it as a lisp designed specifically for web applications.

12 Name: #!/usr/bin/anonymous : 2008-02-03 03:25 ID:Heaven

>>11

For web applications that use <table>s and do not handle character sets?

13 Name: #!/usr/bin/anonymous : 2008-02-03 04:01 ID:Heaven

>>12
It can't be that bad??

14 Name: dmpk2k!hinhT6kz2E : 2008-02-03 08:33 ID:Heaven

> Of course if you just cache properly in the first place, execution speed won't mean dick

Only for small sites with no dynamic content.

> optimizing your web (nginx++) and proxy servers or the like.

A microoptimization in the wrong place.

I'll never understand most web developers. They're like people who use V-tec stickers on a hatchback.

15 Name: #!/usr/bin/anonymous : 2008-02-03 09:16 ID:Heaven

>>14

>A microoptimization in the wrong place.

Caching and front end optimization are critically important, don't even try to say otherwise. And anyway, don't get so tripped up on that one comment. I was just using it as an example of not being execution bound (which nobody in the real world is...), you could easily replace web/proxy in that sentence with database and memcache.

Me? I'll never understand Asperger syndrome.

16 Name: #!/usr/bin/anonymous : 2008-02-03 12:29 ID:A7jyxFQb

>Only for small sites with no dynamic content.

That's a misconception. A site doesn't have to be 100% static to benefit from proper caching. For instance, say you have some information that updates hourly. Do you follow the dmpk2k strategy of saying that caching is a waste of time and hit the system every request, or do you follow a sensible strategy of properly setting the cache parameters so that caches know to keep their own copy for the correct period of time?

17 Name: #!/usr/bin/anonymous : 2008-02-03 13:12 ID:Heaven

>>13

> Which is why, incidentally, Arc only supports Ascii.
> ...
> Arc embodies a similarly unPC attitude to HTML. The predefined libraries just do everything with tables.

http://www.paulgraham.com/arc0.html

18 Name: #!/usr/bin/anonymous : 2008-02-03 18:05 ID:Heaven

>>17

Well, from an otherwise respectable source, that just seems like an odd and IMHO wrong set of decisions. One of his motivation is to not get in the way of working HTML, and a bunch of useless tables is better than pure semantic structure!?

19 Name: dmpk2k!hinhT6kz2E : 2008-02-03 18:18 ID:Heaven

>>15
Reread what you're replying to in >>14, please. That specific line is about Nginx. Do you seriously think Nginx as a reverse proxy for mongrels driving some Ruby framework is going to make one wit of difference?

Except that a lot of people in the Ruby webapp community get in a lather about it. Methinks they've never run anything beyond a minuscule site -- or written any C.

>>16
Is that what passes for dynamic? It's not changing per request -- if you dumped it to a static HTML file it would behave the same. In that case of course Squid and Cache-Control can help.

Back in the real world, where frameworks are usually used to change what's served to each visitor, caching won't be so great a help. You can cache images and CSS all you want, but serving them wasn't expensive on the CPU in the first place compared to the actual webapp.

20 Name: #!/usr/bin/anonymous : 2008-02-03 18:53 ID:Heaven

>>19

>Do you seriously think Nginx as a reverse proxy for mongrels driving some Ruby framework is going to make one wit of difference?

I have that exact setup here, what are you asking?

>Back in the real world, where frameworks are usually used to change what's served to each visitor,

Are they, really? Pick any site from Alexa's top 100 and see that the vast majority of those pages aren't customized. In fact most of the sites make an extreme effort to avoid customization (no usernames, generic links, etc.).

>caching won't be so great a help.

So what happens when we're serving 1000 Req/s and we cache 1/2 of a page in memory, for a even as little as a single second? What did we just do to our execution time (aka. server load ;-)

21 Name: dmpk2k!hinhT6kz2E : 2008-02-03 19:54 ID:Heaven

> I have that exact setup here, what are you asking?

Use top to determine the difference between the HTTPD and the webapp. Nginx is probably hovering somewhere around 0%. I don't understand why many Ruby webapp developers care so much about their reverse-proxy [Apache/Lighttpd/Nginx/whatever it is now] when it won't make a difference?

> Pick any site from Alexa's top 100 and see that the vast majority of those pages aren't customized.

I can't argue with that. At that scale everything changes -- they usually have their own custom frameworks (probably Jaaaavaaaa J2EE enterprisey), the most heavily hit pages are static, they use a CDN, and so on. Ignoring bandwidth, one reason they have to do that is because execution time matters, although it's probably their databases they're worried about.

> So what happens when we're serving 1000 Req/s and we cache 1/2 of a page in memory

A good idea, as is using memcache and the whole spiel. Your execution time will still mean something though -- the load on app servers comes from somewhere. Also, with a slower language some problems become less feasible. I saw a presentation where a person was using divide-and-conquer in a Google Maps mashup, and had to rewrite that part in C. Wouldn't it be nice if that wasn't necessary?

Actually, my biggest beef with MRI (and presumably YARV) isn't that it's slow -- the real bottlenecks in webapps lie elsewhere, particularly the problem of scaling the database. No, I take issue with the GC marking object directly, thus preventing the OS from using copy-on-write to any real effect. So you're stuck with ten or so mongrels per box because few pages are shared across processes.

22 Name: #!/usr/bin/anonymous : 2008-02-03 20:20 ID:VBqqpniK

The whole complaint about <tables> and unicode is stupid. The programmer doesn't deal with either of those things in arc. The fact that the programmer has to deal with those things in other languages points out what's stupid about those languages, and isn't infact exposing anything about arc.

23 Name: #!/usr/bin/anonymous : 2008-02-04 00:47 ID:j/yqEPJq

>>22

You... you aren't making enough sense for me to even start responding to that.

Could you try to explain that one more time?

24 Name: #!/usr/bin/anonymous : 2008-02-04 02:05 ID:Heaven

>>21

>I don't understand why many Ruby webapp developers care so much about their reverse-proxy [Apache/Lighttpd/Nginx/whatever it is now] when it won't make a difference?

Again, try not to stumble on the Nginx thing. I use it for content expiration, url rewriting, and output compression, nothing more. And for me at least, those three things make a huge difference..

25 Name: #!/usr/bin/anonymous : 2008-02-04 21:33 ID:Heaven

>>23 The programmer of an application doesn't write HTML in lisp-languages. They don't use templates either. They don't generally deal with unicode transformations, and they don't worry so much about such low-level things.

Did you understand that?

A Python programmer has to be aware that the unicode text transformation occurs transparently and automatically and you need to remember to normalize and sanitize your inputs. A PHP programmer has to be aware of the various ini-file magicks and to be dilligent to make sure your HTML tags balance.

A lisp programmer doesn't do these things. The people complaining about <table> and unicode think that because they deal with these things every day that they need to deal with them. They don't; their programming language is just stupid. Lisp is better because you can make it less stupid.

Take tables for example: You emit effectless semantics with div and class and then "pretty those things up". That prettying up takes time away from buildin your app. Using tables gives you your app faster, and you can still "pretty those things up".

Or take unicode: When someone posts a unicode string, why do you care that it's unicode? You're either going to save it, display it, or ignore it. You only care if it's going to cause special effects so simply escape it and move on. You need to elipsize it? Make a utf8_elipse function. You don't need the language to automatically transform unicode for you in order to deal with unicode strings.

A lisp programmer writes these things if they need to. If it comes up twice, they macro it. If it comes up zero times, they don't bother.

26 Name: #!/usr/bin/anonymous : 2008-02-04 22:57 ID:Heaven

>>25

>Using tables gives you your app faster, and you can still "pretty those things up".

And that's why we can't have nice things.......

27 Name: #!/usr/bin/anonymous : 2008-02-05 03:52 ID:Heaven

>>25
"What" to the 50th power.

28 Name: #!/usr/bin/anonymous : 2008-02-05 04:12 ID:Heaven

>>27 "What" will have to be qualified. I'm not stuttering. If you don't understand something, you will have to say what in particular.

29 Name: dmpk2k!hinhT6kz2E : 2008-02-05 06:28 ID:Heaven

> When someone posts a unicode string, why do you care that it's unicode?

I've found that sorting and shortening strings is useful. I like my regex working too.

30 Name: #!/usr/bin/anonymous : 2008-02-05 12:25 ID:vwnO7WC3

>>25

> Or take unicode: When someone posts a unicode string, why do you care that it's unicode? You're either going to save it, display it, or ignore it.

Or, you know, process it. Like you do in a programming language. You take input, and you process it, and you produce output. And you can't process a string without understanding its character encoding.

Look, you can ignore the character set in every other language too. You don't have to know it. This is not a feature. Arc or Lisp is not better than everybody else because you can only deal with the equivalent of char pointers. It just means you're on the same level as C code.

> Take tables for example: You emit effectless semantics with div and class and then "pretty those things up". That prettying up takes time away from buildin your app. Using tables gives you your app faster, and you can still "pretty those things up".

In a similar vein, you can output shitty HTML in every other language, too. But that's not really a feature either, is it?

31 Name: #!/usr/bin/anonymous : 2008-02-05 12:51 ID:A7jyxFQb

What? Using tables doesn't give you the app any faster than leaving it in unstyled divs. Additionally it makes it harder later on when you do want to reorder the blocks, and it slows down rendering time.

And indeed, without knowing the encoding there isn't much you can do with a string. You can't even find the first character in it without knowing the encoding. You can't even uppercase it or lowercase it. You can't reverse it (but then again who does, honestly...)

If a language is going to remove the need to think about encodings, the best way for it to go about it is to expose everything as unicode characters in the first place. Java is half way there but even a String in Java may have a length() different from the count of code points in the string. :-/

32 Name: #!/usr/bin/anonymous : 2008-02-05 17:26 ID:p5s4fl90

>>31
javascript is already there. having to write code to convert strings to shift jis for http://hotaru.thinkindifferent.net/trip.html from scratch because no one had ever had to do it in javascript before was a little annoying...

33 Name: #!/usr/bin/anonymous : 2008-02-05 17:40 ID:Heaven

>>31 Wrong.

(tab (map row data))

is shorter than:

print "<div class=\"data\">"
for i in data:
print "<div class=\"row\">"
for j in i:
print "<div class=\"col\">",j,"</div>"
print "</div>"
print "</div>"

And no, it doesn't slow rendering time. All browsers render tables faster than float or grid views made from divs.

> without knowing the encoding there isn't much you can do with a string.

You're chasing phantoms here. The client sends data using ISO-8859-1 or UTF-8. Your <form> has a hidden input field in it called "charset_detect" that contains a byte-representation that is different between ISO-8859-1 and UTF-8 (like &nbsp;). You then use this information to upcode ISO-8859-1 into UTF-8. That sounds like library code to me.

Once there, you can normalize it if you like. You can also compose the characters. This requires a database of codepoints so it too sounds like library work.

You can compare the string as bytes, and you can compare a substring. This is easier with bytes because you don't have surprise recodings!

Collating? Sorting? Upcasing? Downcasing? These are always library routines because they involve a database. They don't need to be built in to the language.

Look at Java's mistake as a prime example of how not to do unicode. By thinking unicode would always fit into 16 bits, they made it look like it's going to work most of the time, but fail in subtle hard-to-test ways.

Character sets was a bad hack that interoperability layers need to deal with. Those interoperability layers best belong in library code so they can be improved separately without requiring secret knowledge of the object internals- and painting yourself into a corner like Java and Win32 did.

34 Name: #!/usr/bin/anonymous : 2008-02-05 17:46 ID:i+ITJfDJ

>>29

> I've found that sorting and shortening strings is useful. I like my regex working too.

Good. Let me know when a language supports unicode that does that.

Here's a hint: Where does ÞORN get sorted? After Z? Between T and U? After TH but before TI? Mix with TH? Sorted as Y? Mixed with P? Transliterated as TH? Transliterated as T?

By you thinking your language supports unicode, you write code that doesn't handle these cases. Your program will suddenly generate an error when faced with this data and your user will be unhappy.

On the other hand, by simply treating everything as bytes you know exactly how involved you are and need to be. You can avoid algorithms that depend on sorting characters (which seems to be locale-specific) and you can avoid algorithms that change case (which also seems to be locale-specific). That's becuase you're supposed to be avoiding these things anyway. Your language has made you lazy and stupid and the way out ISN'T to just be more careful- to just try harder. It's to stop worrying about this crap altogether.

If someone ever figures out how to do unicode right, or if this were an easy thing, I could possibly agree, but it isn't. Unicode is really fucking hard, and nobody has gotten it right.

35 Name: #!/usr/bin/anonymous : 2008-02-05 18:22 ID:Heaven

> Your program will suddenly generate an error when faced with this data and your user will be unhappy.

Hmm. Let's see here...

>>> a = [u'ÞORN', 'PORN', 'YARN', 'ZEBRA', 'TOUHOU', 'TANK', 'PRALINE', 'PAGAN', 'THEME', 'TITMOUSE']
>>> a.sort()
>>> for i in a: print i

...
PAGAN
PORN
PRALINE
TANK
THEME
TITMOUSE
TOUHOU
YARN
ZEBRA
ÞORN

Oh look, no error.

And suppose you wanted to make an alphabetical index:

>>> for i in sorted(set(i[0] for i in a)): print i

...
P
T
Y
Z
Þ

Still no error, and it works fine. Now if you were just blindly manipulating strings, as you suggest, you would have a problem, because you'd be dumping the first byte of a multi-byte character. But you're right! Why would people need to see the entire letter anyway? They can just guess.

36 Name: #!/usr/bin/anonymous : 2008-02-05 18:28 ID:6za5PNDF

>>34
At least in Python and Ruby, you can redefine the comparison operator for any object, strings included. Whatever you put in there affects all sorting operations. So you can put thorn wherever you want.

And in Perl, I've had to get UTF-8 straight in order to process database output. I had to convert certain columns to uppercase. And we have another app where the character-length (as opposed to byte-length) of strings is very important.

And I just dabble in internationalization, really. For all I know, there are more elegant solutions than the one I used. And you might ultimately be right; in some cases, you don't care about the encoding, you're just shunting bits to and fro. I, for one, want to make sure my app always knows what kind of data it's dealing with.

37 Name: #!/usr/bin/anonymous : 2008-02-05 19:25 ID:i+ITJfDJ

>>35 Þ sorts differently in different languages.

> Now if you were just blindly manipulating strings, as you suggest, you would have a problem,

Read my post again. I didn't say anything about blindly doing anything: I actually said the exact opposite.

>>> print sorted(file("test.txt","r").readlines())
['PAGAN\n', 'PORN\n', 'PRALINE\n', 'TANK\n', 'THEME\n', 'TITMOUSE\n', 'YARN\n', 'YOUHOU\n', 'ZEBRA\n', '\xc3\x9eORN\n']

What's the encoding of this file again? I know, let's assume utf-8!

>>> for i in file("test.txt","r").readlines(): print i.decode('utf-8')

Well that seems to work. Let's just hope users never actually control the contents of test.txt:

>>> for i in file("test.txt","r").readlines(): print i.decode('utf-8')
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data

It happens all the time. People do a lot of work, then some user posts actual unicode where it isn't expected and through the magic of transcoding the entire database is hosed.

38 Name: #!/usr/bin/anonymous : 2008-02-05 19:35 ID:i+ITJfDJ

>>36

> At least in Python and Ruby, you can redefine the comparison operator for any object, strings included. Whatever you put in there affects all sorting operations. So you can put thorn wherever you want.

Holy crap! How is redefining the comparison operator have anything to do with what I'm saying?

sorting is used for lots of things- not just presentation. For example it's often used for uniqueness verification(!) There it doersn't matter what order is actually produced just so long as it's stable. Having the built-in sort() operator depend on the locale settings means that you can't safely share sorted data between user-sessions.

That sounds like an almost invisible bug that nobody would ever notice.

And none of you have even suggested a single reason why it needs to be in the language. I've given plenty of counter examples, but the best you've got is "oh you might want to know what the first character is". That's stupid. You don't need all these bugs and problems and danger-areas just to get that.

> And I just dabble in internationalization, really. For all I know, there are more elegant solutions than the one I used. And you might ultimately be right; in some cases, you don't care about the encoding, you're just shunting bits to and fro. I, for one, want to make sure my app always knows what kind of data it's dealing with.

Really, I'm only arguing that this is complicated. That there isn't anything easy about "adding unicode" and that the expectation is ridiculous. Everybody is perfectly fine using languages with no unicode support, and the people that do use unicode-enabled languages frequently have subtle bugs that they don't notice until much later. Getting it right is hard- and I say that because nobody has gotten it right yet; there isn't a language with unicode support where that unicode support isn't so filled with implementation-specific gotchas.

It's not just languages either- protocols do it too. HTTP insists the default charset is iso-8859-1 even though mime says it's text/plain. That means you can't save an HTML document unless your filesystem is character-set aware. How stupid is that!?

Seriously. Really smart people fuck this up totally. To blame arc for not getting it right on a version-0 when it's hard, and not altogether important to begin with, it's just missing out.

I swear, the first language to support unicode correctly will xor all of the code points with 0x1234 just to make sure values that aren't byte-packed will actually get tested...

39 Name: dmpk2k!hinhT6kz2E : 2008-02-05 20:00 ID:Heaven

>>34

> Good. Let me know when a language supports unicode that does that.

Because almost nothing out there supports Unicode 5.0 100% we should stick to octets? Is this the Better is Better philosophy, how not to win big?

> That's becuase you're supposed to be avoiding these things anyway.

The things you've just so casually dismissed are things that I've had to implement the last project I worked on. I really can't imagine "we're supposed to be avoiding these things" will go over well with management when it's part of the spec; the site is supposed to be international, after all.

> To blame arc for not getting it right on a version-0 when it's hard, and not altogether important to begin with, it's just missing out.

Arc is built on MzScheme. MzScheme supports Unicode, but Arc does not. What is wrong with this picture?

40 Name: #!/usr/bin/anonymous : 2008-02-05 20:06 ID:Heaven

>>37
AHow about learning a few things about the language before you start bashing it?

And since in this scenario we're taking user input, let's be a bit lenient with broken input data, too. Because, you know, Python allows you to do that.

>>> import codecs
>>> f = codecs.open('test.txt', encoding='utf-8')
>>> print sorted(f.readlines())

[u'PAGAN\n', u'PORN\n', u'PRALINE\n', u'TANK\n', u'THEME\n', u'TITMOUSE\n', u'TOUHOU\n', u'YARN\n', u'ZEBRA\n', u'\xdeORN\n']

Oh wow, imagine that. I got Unicode data out of it, without having to screw around with .decode() on every damn string.

Now supposing the character has a couple of broken characters in it, I could add errors='replace' to the open() call, and I'll get back Unicode data with the (standard) Unicode replacement character, instead of garbled crap. Not the ideal solution, but the ideal solution would be for nobody to have invalid characters in the first place. Ignoring broken characters doesn't make them go away, but handling them properly will, and as an added bonus, if you want to let your users know that their data might be corrupt, you can do that. Not so if you're just shoveling raw bit strings around.

And if you really have no idea what encoding a file is using, try this: http://chardet.feedparser.org/

> Read my post again. I didn't say anything about blindly doing anything: I actually said the exact opposite.

How is your statement -- "the way out ISN'T to just be more careful- to just try harder. It's to stop worrying about this crap altogether." -- not equivalent to "don't bother to handle character encodings"?

41 Name: #!/usr/bin/anonymous : 2008-02-05 20:26 ID:i+ITJfDJ

> The things you've just so casually dismissed are things that I've had to implement the last project I worked on. I really can't imagine "we're supposed to be avoiding these things" will go over well with management when it's part of the spec; the site is supposed to be international, after all.

What are you saying? Either your language has a locale-sensitive sort() or it doesn't. Both are wrong for something, at least when you don't pretend bytes are bignums sort() is simple and fast.

As soon as you need a sort() that can actually handle multiple languages (a sort for presentation) you need one that accepts locales as well, and frankly your language doesn't have such a beast so you have to write it yourself anyway. You simply can't use the builtin sort() operator for this, so why have a builtin sort() that is slower than it has to be?

Case-folding is another one. Why bother having a str.islower() that gives the wrong answer for unicode? Why bother having a str.lower() that is wrong for unicode?

> Arc is built on MzScheme. MzScheme supports Unicode, but Arc does not. What is wrong with this picture?

That MzScheme doesn't support unicode correctly? What's your problem?

You want arc to support something poorly just because mzscheme does?

What's the point of striving for the most powerful language if you're just going to shit on it with something so obviously important and so obviously complicated that nobody seems to be able to get it right?

42 Name: #!/usr/bin/anonymous : 2008-02-05 20:38 ID:i+ITJfDJ

>>40

> Oh wow, imagine that. I got Unicode data out of it, without having to screw around with .decode() on every damn string.
>>> import codecs
>>> f = codecs.open('test.txt', encoding='utf-8')
>>> print sorted(f.readlines())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/codecs.py", line 626, in readlines
return self.reader.readlines(sizehint)
File "/usr/lib/python2.5/codecs.py", line 535, in readlines
data = self.read()
File "/usr/lib/python2.5/codecs.py", line 424, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 57-58: invalid data

What was your point exactly?

How about you learn a few things about the language before you support it?

> Now supposing the character has a couple of broken characters in it, I could add errors='replace' to the open() call

So you're saying that you'd rather destroy user's data than do the right thing?

Seriously, what if I had put an 0xC2 0xA8 in there instead? That's valid to both utf-8 and iso-8859-1 and won't raise an error. You'll just get gibberish.

> and I'll get back Unicode data with the (standard) Unicode replacement character, instead of garbled crap.

I.e. You garble the entire string.

> Not the ideal solution, but the ideal solution would be for nobody to have invalid characters in the first place.

No, the ideal solution depends on why you're transcoding it in the first place.

> Ignoring broken characters doesn't make them go away, but handling them properly will,

There are no broken characters. The example I gave was simply in an unknown encoding. You would rather destroy the user's data, and I find that utterly sophomoric.

> and as an added bonus, if you want to let your users know that their data might be corrupt, you can do that. Not so if you're just shoveling raw bit strings around.

Yeah. Shoveling arrays of bignums around that you can't do anything with is so much better.

43 Name: ☆ゆたか☆ : 2008-02-05 20:54 ID:6EmpZiO7

おはようございます!!

44 Name: dmpk2k!hinhT6kz2E : 2008-02-05 21:12 ID:Heaven

> That MzScheme doesn't support unicode correctly? What's your problem?

I would rather have some support than no support. That's where we differ. You're welcome to perfection, but you're using a dynamically-typed language and I doubt you're formally proving your code.

Having said that, it looks like Arc does provide for Unicode, at least so long as it's sitting on MzScheme. Good enough.

> Either your language has a locale-sensitive sort() or it doesn't.

Or maybe it works most of the time for the expected data. I'll make do with a few customers sometimes seeing partially unsorted data rather than all customers seeing completely unsorted data or throwing out a useful site feature. Almost all software development is about good enough.

> What's the point of striving for the most powerful language

How old is Ikarus or Termite Scheme? Clojure? Factor or Cat? They're more interesting than Arc and all are younger.

45 Name: #!/usr/bin/anonymous : 2008-02-05 22:01 ID:Heaven

> How about you learn a few things about the language before you support it?

First you randomly picked an encoding (utf8), and then claimed that, because you picked the wrong encoding, Python's unicode implementation is broken? In a perfect world there would only be one representation of the data on disk, but we have to deal with a lot of different encodings. That's when you make a decent effort to guess, and give the user a prompt in cases where you can't identify the data. You seem to be assuming that users never want to read the data, and that all you're doing is tunneling it from one place to another, which is perfectly fine if all you're doing is writing a proxy script or the like. However, at some point the user is going to want to look at the data, and there's exactly no way at all to present it if you don't know what format it's in. After all, if you don't know the format, you can't manipulate anything, and unless your goal is to reimplement 'dd', what use is your program?

> So you're saying that you'd rather destroy user's data than do the right thing?

If it's supposed to be utf8, but it's got broken characters, it's already been destroyed. Unless your definition of the right thing is making a file-copy command that writes out exactly what it reads in, you're making assumptions about the input data that you can't verify without handling the encoding to some degree. What if the input is actually utf-16? Then suddenly all your functions that iterate line-by-line will destroy the data because the EOL is two bytes, and depending on how well you're handling the rest of the data, you might end up with nothing at all -- since utf-16 contains null bytes within the string itself.

You still have yet to supply anything to support your argument. All you're doing is hand-waving.

46 Name: #!/usr/bin/anonymous : 2008-02-05 23:32 ID:Heaven

>>45

> You still have yet to supply anything to support your argument. All you're doing is hand-waving.

I was going to say the same thing to you.

You haven't demonstrated what having unicode in the language is good for.

I've demonstrated that it's a good way to hide bugs.

>>44

> I'll make do with a few customers sometimes seeing partially unsorted data rather than all customers seeing completely unsorted data or throwing out a useful site feature.

You don't get it. You're making do with that because your language's sort operator has to work on its strings and it's sort operator cannot be useful to humans. You need a human-friendly sort, and you need to be aware of locales.

This is complicated stuff. Saying "mzscheme has it so arc has it" is naive. mzscheme's unicode support isn't better than anyone elses and it causes problems.

47 Name: dmpk2k!hinhT6kz2E : 2008-02-05 23:41 ID:Heaven

> You don't get it.

Probably not. My understanding so far is that you're arguing that I should make no attempt to sort at all until it's perfect. Yes/no?

> You're making do with that because your language's sort operator has to work on its strings and it's sort operator cannot be useful to humans.

Could you rephrase this?

48 Name: #!/usr/bin/anonymous : 2008-02-06 01:58 ID:Heaven

>>47

> My understanding so far is that you're arguing that I should make no attempt to sort at all until it's perfect. Yes/no?

You shouldn't use the language builtin sort-operator when you want a locale-sensitive sort. In POSIX C, you can use qsort() with strcoll() and setlocale() to get a locale-sensitive sort. In python this (usually) means mapping locale.strxfrm first. Perl has a POSIX::strxfrm().

I'm saying you shouldn't use sorted() if you're printing it out to human beings. If you insist anyway- you can't come around and argue that it is useful to bury unicode into the language semantics if you're not using it.

> > You're making do with that because your language's sort operator has to work on its strings and it's sort operator cannot be useful to humans.
> Could you rephrase this?

Sure. strxfrm() works on bytes, and not unicode strings in most environments, and yet it is almost certainly what you want when you're printing a sorted list. Unfortunately, the obvious and simple sorted() isn't what you want- but is what you want when you want a stable (locale-ignorant) sort for unique detection and other algorithms. However that kind of sort doesn't care whether your bytes are 8 bits or 24 bits; the unicode support in your language in this case was a waste of code.

49 Name: dmpk2k!hinhT6kz2E : 2008-02-06 06:25 ID:Heaven

> You shouldn't use the language builtin sort-operator when you want a locale-sensitive sort.

Why shouldn't the language's built-in sort support this?

50 Name: #!/usr/bin/anonymous : 2008-02-06 14:51 ID:Heaven

> Why shouldn't the language's built-in sort support this?

Because a locale-sensitive sort isn't necessarily stable.

More to the point, it's the comparison function that is more dangerous (most sort algorithms simply require a comparitor that returns left or right, -1 or 1, etc). A locale-sensitive sort would change the output as soon as your locale changed.

51 Name: #!/usr/bin/anonymous : 2008-02-06 15:41 ID:vwnO7WC3

> The whole complaint about <tables> and unicode is stupid. The programmer doesn't deal with either of those things in arc. The fact that the programmer has to deal with those things in other languages points out what's stupid about those languages, and isn't infact exposing anything about arc.

This was the original claim made. I still haven't seen anything even remotely resembling a justification for this.

All I've seen is some kind of claim that because implementing Unicode is hard, we should all just use byte arrays like we were coding in C. That doesn't really seem like "the programmer doesn't deal with these things", it seems like "the programmer has to implement these things from scratch again and again" or "the programmer has to use clumsy libraries to deal with these things".

That hardly seems like a good choice for an language for "exploratory programming".

52 Name: dmpk2k!hinhT6kz2E : 2008-02-06 18:34 ID:Heaven

> Because a locale-sensitive sort isn't necessarily stable.

Many implementations of quicksort aren't stable either -- or even deterministic -- but that hasn't stopped anyone. You can add an index field to force stability if that's desired.

> A locale-sensitive sort would change the output as soon as your locale changed.

Often that's the point. If I know a user speaks Norwegian, either through their preferences or Content-Language, I'd like what they see to be sorted in an order they're familiar with.

If I don't know, well, better an ordering than no ordering.

53 Name: #!/usr/bin/anonymous : 2008-02-06 20:27 ID:Heaven

>>52 So you accept that building unicode into the language didn't help with sorting?

>>51 The programmer has to use clumsy libraries anyway. They just think they don't because they think having unicode built into the language is making fewer problems when it is infact making more.

> I still haven't seen anything even remotely resembling a justification for this.

The onus isn't on me to demonstrate the non-usefulness of a non-feature, it's for you to demonstrate the usefulness of a feature; I said arc lacking unicode doesn't matter but you seem to think it does.

You think it's easy
You think it lets you do something you couldn't otherwise
You think it's saving you time
You think it's more elegant

Not only have I provided a demonstration that none of these are true, I've gone better to demonstrate it's a good place to hide bugs and that the use is often unexpected.

54 Name: #!/usr/bin/anonymous : 2008-02-06 21:24 ID:j/yqEPJq

>>53

None of that has anything to do with "The programmer doesn't deal with either of those things in arc."

That is a completely different, and completely insane statement. If it was actually true, it would mean pretty much the exact opposite of what you're arguing would somehow be true.

55 Name: #!/usr/bin/anonymous : 2008-02-07 03:18 ID:Heaven

>>54 No it has to do with the negative justification for unicode in the language.

Really: If I found myself needing some unicode transformation I'd write a macro and I'd be done with it. Instead in a language like python, you have to find out about the problem, go edit all your uses of sorted(), and appologize to the user for not understanding the full scope of the problem when you decided to use a language like Python.

If you think not having unicode builtin to arc causes problems, you're going to have to prove it.

56 Name: #!/usr/bin/anonymous : 2008-02-07 16:09 ID:Heaven

> If I found myself needing some unicode transformation I'd write a macro and I'd be done with it.

In other words, you'd have deal with it. Manually.

And so "The programmer doesn't deal with either of those things in arc" is still untrue.

57 Name: #!/usr/bin/anonymous : 2008-02-07 17:57 ID:Heaven

>>56 That's your problem? Seriously?

The programmer in lisp sees things like this:

(elipse s)
(prn s)

he doesn't see:

elipse(s.decode('utf8'))
"<table><tr>"+"</tr><tr>".join(map....)."</tr></table>"

That's what I meant by not dealing with it.

Some more examples. I can use:

(utf8-chars s)

and:

(len s)

instead of:

s.decode('utf8').length
do { use bytes; length($s); };

and:

s.encode('utf8').length
do { use utf8; length($s); };

Yeah. Building it into the language seems like a big win.

Meanwhile, Pg says he's going to make a case for unicode. It's not important for any of the programs arc has been used for so far, and it's not a showstopper for any real applications anyone is writing. Let's see it right, rather than repeating the crap that python, perl, Java, and well everyone else did.

58 Name: #!/usr/bin/anonymous : 2008-02-07 22:01 ID:Heaven

> s.decode('utf8').length
> s.encode('utf8').length

You mean len(s.decode('utf8')) and len(s.encode('utf8')). And if you're actually using unicode in an app, you really ought to be using it end-to-end anyway, which means you're doing your encoding/decoding at input and output -- and that's something Python already gives you with codecs.open().

How often do you need to know how many bytes something is, except when writing something to disk? (Which, incidentally, falls clearly under the category of "output".) Even then, it's not generally necessary.

59 Name: #!/usr/bin/anonymous : 2008-02-08 01:58 ID:Heaven

No! You shouldn't be using it end-to-end! You should be avoiding transformations because they take time, and introduce more places for bugs to hide. Saying you need len() to mean characters instead of storage size becuase "How often do you need to know how many bytes something is, except when writing something to disk?" is about the stupidest rationalization I've ever seen!

Web apps receive data in byte-escaped utf8 or iso-8859-1/15. Besides the strings you're receiving, why would you bother coding every other string just to get these?

Unicode strings just plain aren't useful. They may be convenient for the language designer, but they aren't useful to the programmer. They only need two kinds of strings normally: byte strings, and locale-aware presentation strings. The fact that you can build both with unicode strings isn't a feature of unicode strings.

60 Name: #!/usr/bin/anonymous : 2008-02-08 02:49 ID:Heaven

>>57

You can do that in almost any language ever, if you just use libraries that hide all the hard work from you. That's not a strength or weakness of any language.

61 Name: #!/usr/bin/anonymous : 2008-02-10 12:17 ID:Heaven

The problem with byte strings is that for any given byte string declared somewhere in the application, nobody knows the encoding of it to convert it into usable characters. So you end up making some class which ties the encoding together with the byte array. Or, standardising on some encoding. But if your application is going to be used by more than a single country, then what choice of encodings do you have if you want to support them all?

Even wchar_t has an encoding of course, it's just that it's always UTF-16. And UTF-16 still requires two units to encode a single character in some situations; the way it does it happens to be very similar to UTF-8.

I'm yet to see a system which plans ahead and uses UTF-32 for strings. It would take more memory, sure... but memory isn't that expensive.

62 Name: #!/usr/bin/anonymous : 2008-02-13 21:46 ID:i+ITJfDJ

>>61 Justify converting it into usable characters.

printf() certainly accepts bytestrings. So does xterm and your-favorite-web-browser-here, so what exactly does one need to convert a bytestring into "usable characters"?

The operations that people have mentioned: sorting, elipsing, and substringing vary more based on the locale than on the supplied encoding- cases where you want sorting, or subsections for non-presentation uses (machine algorithms like unique-detection and counting) bytestrings are satisfactory.

Btw, Erlang uses 32-bit integers for strings, and factor uses 24-bit. Even those aren't "big enough" for the eventual case when someone decides a bignum is needed...

63 Name: #!/usr/bin/anonymous : 2008-02-14 12:23 ID:A7jyxFQb

>what exactly does one need to convert a bytestring into "usable characters"?

This is the dumbest question I have heard here yet. Any application which needs to do actual processing on text will need to know where one character stops and another character begins. If you don't know what character set is inside your blob of byte[] then how do you even find out where the spaces are to break at the word boundaries?

64 Name: #!/usr/bin/anonymous : 2008-02-14 14:44 ID:i+ITJfDJ

>>63 That's as stupid as insisting you can't use a void* to refer to an int because you don't know that it is infact an int. Look at this another way: Your text files on your hard disk don't have character set tagging and yet you can read from them just fine.

If you're reading utf8 files into strings, you know the bytestring contains utf8. If you're reading shift-jis files into strings you know the bytestring contains shift-jis. You generally isolate all of your charset and locale-awareness into a specific part of your program. You don't need to pepper it all over the fucking place to go wordwrap(s,72).

wchar_t was bad engineering. It convinced a lot of people that you needed another set of string-api, and another kind of string. You don't. Your filesystem doesn't support those kinds of strings anyway, so it doesn't really add any features, or give you any new expressiveness (or conciseness), but it does introduce new strange places to hide bugs.

The fact remains: unicode-support in the language doesn't buy you anything, and costs you a lot. You still need to (as the programmer) be aware of charset-conversion at input and output because that information isn't available from the environment. Two trivial examples that don't exist in reality don't change that. Your isspace() example could simply be called utf8_isspace() because you still need to know what was inputted was utf8 anyway.

Maybe this'd be different if the filesystem encoded charset and locale information reliably. It doesn't though, so you're still tasked (as the programmer) of working primarily in bytestrings, and transcoding explicitly when directed.

65 Name: #!/usr/bin/anonymous : 2008-02-14 16:47 ID:Heaven

>>64

> Your isspace() example could simply be called utf8_isspace() because you still need to know what was inputted was utf8 anyway.

Oh yes, let's hardcode everything to use utf8, and force everyone to use it. That's much better than supporting a wide range of character sets, being flexible, and allowing people to load their existing files without running their entire hard drive through recode, remastering all their CDs and DVDs, and proxying every website they look at.

Not to mention perhaps people want to read an existing SJIS formatted text file when the system they're using defaults to UTF-8. So now you have two encodings to worry about. Are you going to keep your calls to utf8_isspace and sjis_isspace straight? Or will you go insane encoding and decoding manually at every step instead of using plain and SIMPLE Unicode in the backend, and setting input_encoding and output_encoding flags for the I/O subsystem?

66 Name: #!/usr/bin/anonymous : 2008-02-14 19:11 ID:i+ITJfDJ

>>65

> That's much better than supporting a wide range of character sets,

It is.

> and allowing people to load their existing files without running their entire hard drive through recode,

How does that follow? Your unicode-aware language is recoding their files every time you load them, but you don't know what the original coding is. My plan only recodes them when we think it's worth the bother, and hopefully we can be aware that figuring out the charset is part of that bother.

> remastering all their CDs and DVDs,

How does this follow? Your unicode-aware language needs to be aware of all character sets in order to read any of them. Mine doesn't even bother most of the time and reads CDs and DVDs just fine.

> and proxying every website they look at.

You have to do this anyway- you cannot save an HTML file from the web on your hd without transcoding all the entities to us-ascii or having a charset-preserving filesystem. HTML has a different "default" charset on a disk (us-ascii) than it does on the wire (iso-8859-1).

> Not to mention perhaps people want to read an existing SJIS formatted text file when the system they're using ... So now you have two encodings to worry about.

You already have two encodings to worry about. You don't know the charset of the file you're loading because your filesystem doesn't preserve that information reliably.

> Are you going to keep your calls to utf8_isspace and sjis_isspace straight?

Why do you think I have to? Only a very small class of programs doing a very small class of things will ever have to deal with SJIS- let alone any other character set. For a web app I'll only ever see %-encoded utf-8 or iso-559-1. If I'm writing a text editor, presumably I translate the code-points to glyphs that I'll use during rendering.

> setting input_encoding and output_encoding flags for the I/O subsystem

The I/O system only deals with bytes. TCP only deals with bytes. Disks only hold files containing: bytes. If your internal representation is utf8, and your external representation is utf8, why the hell would you transcode at all?

67 Name: #!/usr/bin/anonymous : 2008-02-14 23:14 ID:Heaven

>>64

Look, what you're talking about is how programming languages used to work in the past. Perl 4 was like that. C is still like that.

Turns out, it fucking sucks to do things that way. That is why all languages add unicode support these days. They don't do it because they've been brainwashed by some Unicode conspiracy to design their languages to be horrible. They do it because it makes everything much easier, and everybody knows it, because the alternatives have been tried.

68 Name: #!/usr/bin/anonymous : 2008-02-15 00:50 ID:Heaven

> How does this follow? Your unicode-aware language needs to be aware of all character sets in order to read any of them. Mine doesn't even bother most of the time and reads CDs and DVDs just fine.

What?! By that logic, your brain needs to be aware of all languages in order to understand any of them.

I'm not even going to bother replying to the rest of that nonsense because you're obviously a fucking troll.

69 Name: #!/usr/bin/anonymous : 2008-02-15 13:19 ID:A7jyxFQb

>>68 is quite right.

To cite a concrete example, Java uses UTF-16 as its native string storage and the extra charsets are an optional part of the install. If you still install them, then everything still works fine unless you happen to run into one of those character sets.

>>64 seems to think that decoding a string at every single function is an efficient way to write code. You do understand that decoding has overhead. Right?

I don't really object to UTF-8 though. At the very least it's no better nor worse than UTF-16, as they both work in more or less the same fashion for characters which are outside the range that can be represented by a single code unit.

Oh yes, and DVDs work fine because the subtitles are stored as GRAPHICS. I think you will find that real subtitles require knowing the encoding of the subtitle file, whether it's standardised, stored in the file or somewhere else.

70 Name: #!/usr/bin/anonymous : 2008-02-15 15:37 ID:Heaven

>>69 WRONG. Java doesn't use UTF-16. Check again!

It uses a character set that ONLY java uses or supports. It's based on UTF-16, but some code points have the wrong byte-value for UTF-16. If you read a java binary file with UTF-16 you will destroy data and never know it.

Yet another reason why transparent transcoding is stupid.

> >>64 seems to think that decoding a string at every single function is an efficient way to write code.

64 thinks that decoding and recoding strings all the time is a stupid and mind-blowingly retarded way to write code. Why do you think I'm saying otherwise? Because I'm saying having transcoding built into the language and transparent is stupid and worse than useless? Because I'm saying (and demonstrating) that transcoding bugs are hard to locate because they are entirely data-related? Or because having every string be a transcoding unicode string and every i/o operating be a transcoding unicode operation means that EVERY STRING operation and EVERY IO operation is a place where you could have a data-related bug that you might never find.

You don't need to transcode very often. Having it built into the language makes it easier to transcode- when you don't have to, and when you're doing it wrong. It hides bugs, and it doesn't solve problems.

68 has failed to indicate a single reason why having the language transcode for you transparently and invisibly is a good thing, and YOU >>69 are defending him at the same time saying you're worried about performance. What is wrong with you?

71 Name: dmpk2k!hinhT6kz2E : 2008-02-15 17:12 ID:Heaven

> EVERY STRING operation and EVERY IO operation is a place where you could have a data-related bug that you might never find.

This sounds like the manually- versus automatically-managed memory argument again.

72 Name: #!/usr/bin/anonymous : 2008-02-16 12:24 ID:Heaven

> 64 thinks that decoding and recoding strings all the time is a stupid and mind-blowingly retarded way to write code. Why do you think I'm saying otherwise?

Because you suggested having utf8_* functions to do every single string operation. What do you think UTF-8 is, a chicken? It's an encoding!

73 Name: #!/usr/bin/anonymous : 2008-02-16 12:26 ID:Heaven

> Yet another reason why transparent transcoding is stupid.

Um, Java doesn't do it transparently, you have to specify the charset in almost all situations unless you specify nothing, in which case it uses the platform default encoding (dangerous in its own way, but not the topic of the conversation.)

74 Name: #!/usr/bin/anonymous : 2008-02-16 13:36 ID:ONvOLVru

>>72 I most certainly did not! I said you can use utf8_* function if you know the content is utf8 and it matters. If It doesn't matter, don't transcode it. Don't even look at it! Most of the things you want to do with a string are the same when treating it as a bytestring. The special cases are: locale-sensitive compare, and a character-sensitive elipsing/wordwrapping. If you're writing routines to do this over and over again then yes, you should have it in your language. However if you're not, then why are you translating to bignum arrays all the time? Why is substr so slow?

If you think there are other special cases, I'd love to hear about them. Nobody seems to post any of them here.

>>73 "Almost all" situations? I was specifically talking about serialization, but platform-default encoding is a better example.

What exactly is the platform-default encoding anyway? When you save html files to a windows pc, do you convert it to us-ascii? Or do you violate MIME and at least avoid destroying data by converting to to the current codepage, and god-forbid the user change it?

On a Linux pc, what exactly is he default coding? Or a mac?

If there were a meaningful default and lossless coding it might be useful to operate this way, but as it is, the "default" IO often simply destroys data, and nobody ever notices it which I think makes it exactly the topic of conversation: Unicode hides bugs.

Unicode doesn't solve anything in programming languages because the messy nonsense is in locale-specific things and in system-specific things and history demonstrates that programming languages can't really solve either of those things. Because of that I contend that unicode in the language is simply a place to hide bugs and unexpected gotchas, for no real benefit.

http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html

brought up this exact topic- although to very different conclusions. The author suggests you pepper your code with NumberFormatInfo.Invariant and StringComparison.CurrentCultureIgnoreCase. Using strcmp for asciiz strings and strcoll when you're comparing user input seems fine to me. If the environment is unicode, it had to fit into a bytestring anyway for it to get into argv. As you can see the cool thing is that you get better i18n support without putting unicode in your language because thinking in terms of characters being the primitive part of IO is what's wrong.

75 Name: #!/usr/bin/anonymous : 2008-02-16 15:30 ID:Heaven

Oh god, someone put this thread out of its misery.

76 Name: #!/usr/bin/anonymous : 2008-02-17 13:42 ID:Heaven

>>74
Default encoding depends on the locale. The locale depends on the user. Personally on my Linux machines I have it set to UTF-8.

77 Name: #!/usr/bin/anonymous : 2008-02-17 16:43 ID:Heaven

>>76

Right, but if another user doesn't know what encoding it is, that user might not be able to read it.

Seriously: You wouldn't expect cat to refuse to type out a file because $LANG was set wrong, would you?

Unicode is hard. It's not surprising there aren't any languages that have gotten it right yet, and the arguments about "how do you elipse text without a unicode string" demonstrate just how insulated most programmers are from these problems.

What is surprising is how much ignorance there is on the subject- the underlying assumption that unicode is basically a solved problem- and the vehement badmouthing goes into a language that hasn't implemented anything about it yet.

78 Name: dmpk2k!hinhT6kz2E : 2008-02-17 18:26 ID:Heaven

> "how do you elipse text without a unicode string" demonstrate just how insulated most programmers are from these problems.

Hay, I thought that was the idea!

It's why we leave memory to a GC and don't build our own stack frames. If it's hard we should leave it to experts to solve, and hopefully they'll be nice enough to provide us a simple API that hides the hairiness.

79 Name: #!/usr/bin/anonymous : 2008-02-17 22:05 ID:Heaven

> Right, but if another user doesn't know what encoding it is, that user might not be able to read it.

That's not an argument against unicode. That's just common sense and it holds true no matter how you do the string processing in your program.

80 Name: #!/usr/bin/anonymous : 2008-02-17 23:10 ID:ONvOLVru

>>78 I didn't say the expectation was wrong, just the implementation. All these so-called experts are still figuring it out themselves, and have been for over two decades. Ignoring that fact is extremely dangerous.

Until it's a solved problem, all programmers need to be at least vaguely aware of just how bad this is. Right now: files and network-streams use bytes. Operating systems use bytes. Bytes are very well understood. Bignum strings are not.

>>79 No you're absolutely right. It's not an argument against unicode- it's an argument against building unicode into the language only because programmers have to be aware of what's wrong, and why it's wrong. So long as they don't know, and don't think about this sort of thing, they're making mistakes, and some of them big ones.

81 Name: #!/usr/bin/anonymous : 2008-02-18 00:43 ID:Heaven

82 Name: #!/usr/bin/anonymous : 2008-02-18 00:47 ID:Heaven

>>81 I described python's behavior. Other languages aren't dissimilar.

83 Name: #!/usr/bin/anonymous : 2008-02-18 12:50 ID:Heaven

>>82

I am talking about your stuff like utf8_isspace().

84 Name: #!/usr/bin/anonymous : 2008-02-18 13:22 ID:A7jyxFQb

> Seriously: You wouldn't expect cat to refuse to type out a file because $LANG was set wrong, would you?

It will result in bogus output, but it won't be cat's fault. cat ultimately types out bytes, but the terminal will decode it incorrectly and you'll still get garbage. To the end user the result is more or less the same whether it's cat's fault or not though.

85 Name: #!/usr/bin/anonymous : 2008-02-18 18:10 ID:Heaven

>>83 Why do you think utf8_isspace() is worse than .decode('utf8').isspace()? I think utf8_isspace() is better simply because in the latter case, the programmer is likely to simply use isspace() instead and be wrong.

>>84 Are you saying you would prefer cat refuse to type out a file because $LANG was set wrong?

And to be clear: The user isn't always aware what's wrong. Many users insist that utf-8 is wrong on IRC because many people outside the US don't use utf-8 on irc (for whatever legacy reason). IRC doesn't transcode for people, and the protocol doesn't specify the character set. It's a mess, but it's a mess that the programmers ultimately have to deal with.

86 Name: #!/usr/bin/anonymous : 2008-02-19 12:08 ID:Heaven

>>85
utf8_isspace() by itself is fine, until you have utf8_chomp(), utf8_substring(), utf8_indexof(), utf8_charat(), utf8_reverse(), utf8_split() and a few dozen other functions to do all the other string manipulations you need. Then you realise that instead of decoding it into characters once, you're decoding it every time you call a utility method. It makes more sense to get everything into characters up-front, and do the decode/encode only once for the entire journey.

If you're worried about developers using isspace() on a raw byte, then don't put that method on the Byte class, only put it on the Character class. What is a "space byte" anyway?

And yeah, IRC as a protocol is evil (most people on there suck also, but that's another story.) Well, it turned out to not be very good for charsets other than English and someone had to perform some kind of hack to make it work. Since there is no way to specify the character set (although this could have been added via a new command, trivially I might add) UTF-8 would probably be the best choice out of all the bad choices.

At least XMPP standardises on UTF-8. Even though it's based on XML, the standard clearly specifies that use of other character sets is non-standard.

87 Name: #!/usr/bin/anonymous : 2008-02-19 14:21 ID:Heaven

>>85

Why do I think that?

You should be asking, why does pretty much every language designer out there think it is worse. That was my point.

Once again: We tried that, and it was horrible. Now we don't do that any more.

88 Name: #!/usr/bin/anonymous : 2008-02-19 17:58 ID:i+ITJfDJ

>>87

> You should be asking, why does pretty much every language designer out there think it is worse. That was my point.

Oh, okay, so because every language designer thinks cat should complain about poorly coded text files, you think so as well?

I think it's much more likely that every language designer was wrong. Heck, Guido knows he's wrong which is why python2 doesn't work like python1, and python3 will be deliberately incompatible on this point.

Larry went through similar growing pains; 5.0, 5.6, 5.8 and 5.10 all work differently. perl6 may be different still. This obviously isn't solved.

Common-Lisp has strict isolation, and yet the default external coding for files is :default which is simply some undefined implementation-specific coding system.

Javascript implementations frequently disagree on what coding their source and what coding their data is in. us-ascii is the only thing that works portably.

Java has the worst of both worlds: Slow character access, lying length, and a horde of supporters who think that it has "pretty good unicode support".

Scheme and provides no information on what character-code maps to which code point which means it is IMPOSSIBLE to implement case folding or locale-aware collating routines in portable scheme.

How can you possibly believe that every other language has unicode? How can you possibly believe that the state of unicode is such that we can expect a certain level of functionality out of new protocols and new languages?

> Once again: We tried that, and it was horrible. Now we don't do that any more.

GNOME is one of the finest, most i18n-y application suites I've ever seen. It correctly handles bidi, currencies, datetime formats, and so on. And it's written in C. Simply writing _("text") fragments is easy because it encourages the programmer to use the formatting macros, and so everyone does it. It's a small thing to search for strings. It's a much harder thing to search for string manipulation functions.

i18n and localization is quite a bit better in C than it is in all these "new languages" being made by these language designers you keep talking about.

89 Name: #!/usr/bin/anonymous : 2008-02-19 18:27 ID:Heaven

> I think it's much more likely that every language designer was wrong. Heck, Guido knows he's wrong which is why python2 doesn't work like python1, and python3 will be deliberately incompatible on this point.

Yes, everybody was wrong. You're right, now guide us down the path to sanity and design us the perfect language that's completely free of any problems whatsoever. You seem to know everything about internationalization ever, so quit with the small talk and show us your amazing solution.

90 Name: #!/usr/bin/anonymous : 2008-02-19 19:57 ID:Heaven

> Oh, okay, so because every language designer thinks cat should complain about poorly coded text files, you think so as well?

Nobody thinks that outside of your head. Please don't insult our intelligence with such utterly ridiculous strawman arguments. Seriously.

> This obviously isn't solved.

It may not be solved, but you're saying we should stop even trying to solve it, and just suffer on with the solutions we used to have, which were much, much worse.

All you're doing is picking on problems other languages have, but not offering anything that isn't many times worse. Pretty much everybody would much rather have those problems than suffer under the horrible inflexibility you seem to think is preferable.

> GNOME

Is an application, not a programming language. Also, nobody has said that you can't write internationalized code in C. We are saying that it is much more work than in a language designed to handle this properly. And even so GNOME only needs to use a single character set, instead of a multitude of them.

> i18n and localization is quite a bit better in C than it is in all these "new languages" being made by these language designers you keep talking about.

Find a single real programmer who agrees with you on that, and perhaps I will take you seriously.

91 Name: #!/usr/bin/anonymous : 2008-02-19 19:59 ID:i+ITJfDJ

>>89 I have seen many problems, and I can share what I consider obvious. But I do not have such hubris that I think I have all the answers.

Nevertheless, here are some things I think might help:

  1. Using unicharacter strings where the character codes align with ascii makes it hard to test for people unfamiliar with the subtleties of unicode. Possible solution: make it harder for them to be unfamiliar with this by making all the unicharacter codes xor some magic number.
  2. Case transformation and collating are locale-sensitive. All unicharacter library routines should take a locale-identifier where appropriate. The default locale should be Russian everywhere but Russia, and Chinese there. Word wrapping, collating, elipsing, and case transformation should require this locale setting. Bonus: make the locale a dynamic variable instead of an argument.
  3. Streams are always binary. They might have many coded things, and they might even change coding in the middle. Don't ask for coding at open-time, but still make a per-stream default character set so that read() and write() can use a "default" character set sensibly.
  4. read() should accept a character set as an argument. If being given the "default" character set, it will produce a unistring. Without any character-set argument (or null, or None, etc) it should produce a bytestring.
  5. read() should be aware of a per-stream flag that indicates what byte-sequence (when read) stops automatic decoding.
  6. write() should accept a character set as an argument. It should generate an error if it cannot encode. It must not attempt to coerce a string into a bytestring if a character set argument is omitted.
  7. Message strings should be allocated specially, to ensure they're being setup by the i18n allocator. printf-like functions should not process %-masks on strings that didn't come from this allocator (or that aren't constant). This would eliminate a large number of security bugs that come from poor i18n practices as a side-effect.

That covers most of the ones I see on a daily basis. They're all easy to implement as libraries because that's how I use them. The unicode support in most languages is simply too buggy and has too many subtle problems that it's difficult to do the right thing. Obviously, language-oriented support for these things would work differently than these library-centric ideas, but hopefully it would solve the same kinds of problems that these things solve.

92 Name: #!/usr/bin/anonymous : 2008-02-19 20:09 ID:Heaven

>>90

> Nobody thinks that outside of your head. Please don't insult our intelligence with such utterly ridiculous strawman arguments. Seriously.

You've being naive. People do think this. Seriously. What do you expect this to do?

f = file("in")
shutil.copyfile(f, out)
> We are saying that it is much more work than in a language designed to handle this properly.

And I am pointing out that you're wrong on both counts, and to make matters worse no language exists that handles it properly.

> Find a single real programmer who agrees with you on that, and perhaps I will take you seriously.

No you won't. Perhaps you live in a fantasy world where unicode is something everyone does the same way (they don't), and you reject a language that doesn't have explicit support for unicode because "everyone else does" but then you admit that unicode is hard, and that nobody is doing it right yet.

> It may not be solved, but you're saying we should stop even trying to solve it, and just suffer on with the solutions we used to have, which were much, much worse.

No, I'm not. I started by saying don't reject arc because it doesn't have unicode support because it doesn't need unicode support. Unicode support is completely broken WRT the web anyway. I'm saying nobody's got it right so quit acting like the expectation is normal. Your expectation for unicode support is broken and brain-damaged because it isn't being fulfilled by any language you or anyone else here has brought up.

FWIW, I think languages could have useful unicode support if they bothered to look at how people fuck up unicode support and explicitly target that. I'm convinced that the reason nobody has decent unicode support is that it's hard, and I'm objecting to the idea that it's easy and that what every language has is a good thing.

> Pretty much everybody would much rather have those problems than suffer under the horrible inflexibility you seem to think is preferable.

Those people are idiots then, or they've never given it any serious thought. I suspect more of the latter looking at this thread "oh python has unicode support xxx yyy" and that's the end of it.

93 Name: #!/usr/bin/anonymous : 2008-02-19 22:30 ID:Heaven

>>92

Are you even aware that most language do make a difference between byte streams and character streams? I've never seen a language where you had to decode byte streams from a file. You seem to be living in some insane fantasy land where people actually have to do this. If that was actually the case, you'd be right, it would be stupid.

Thing is, that is not the case.

If you're not going to argue about the real world, why should we bother listening to you?

94 Name: #!/usr/bin/anonymous : 2008-02-19 23:38 ID:Heaven

>>93

> Are you even aware that most language do make a difference between byte streams and character streams?

Are you even aware there's no such thing as a character stream in most languages? The stream itself is in bytes. In order to operate in characters you need to decode. That produces charstrings from a bytestring.

Very few languages differentiate between bytestrings and charstring:

>>> 'foo' == u'foo'
True

Python doesn't.

perl -e 'print do { use bytes; "foo"; } eq do { use utf8; "foo"; };'

Perl doesn't either.

(Common Lisp however, does)

Since you don't know the coding of a file (because the filesystem doesn't know), or perhaps since you're writing an IRC client and the coding can change mid-stream (and change back with a terminator character) you're almost certainly writing buggy code.

> I've never seen a language where you had to decode byte streams from a file.

That's a problem. You must decode bytestrings into charstrings if you're going to operate on them as characters. Because character operations work on bytestrings, you can end up working on something you think is coded when it isn't.

I recommend people avoid character operations or explicitly typing their strings as charstrings or bytestrings because it involves keeping track of the coding throughout a potentially long path. Other programmers recommend people simply use unicode for everything, but that causes people to code/decode needlessly and introduces new places where exceptions can be raised in surprise.

> If you're not going to argue about the real world, why should we bother listening to you?

You shouldn't listen to me. You should think about this yourself. Programmers are supposed to be thinking about this sort of thing. You are not thinking about it. You're arguing that other programmers are doing the same thing when they're clearly not and using that as justification to be an asshat.

95 Name: #!/usr/bin/anonymous : 2008-02-20 03:17 ID:Heaven

>>94
How about using the correct comparison operators to answer the question asked, instead of carefully constructing misleading code to support your arguments. Python most certainly differentiates between the two; that's why there are DIFFERENT CLASSES for unicode and str. You can set a coding for a Python file, which defaults to ASCII, and that is what you're testing - that the Unicode representation of a string is equal to the ASCII representation.

>>> 'foo' is 'foo'
True
>>> 'foo' is u'foo'
False
> You shouldn't listen to me.

Done. I shall now fully ignore this thread, for it gets stupider with every post.

96 Name: #!/usr/bin/anonymous : 2008-02-20 17:41 ID:i+ITJfDJ

>>95

I assume you're accepting all other points in >>94 which which according to your statement in >>93 means you agree that almost all languages (including python) have a stupid implementation of unicode.

I'll only cover the specific point brought up in >>95.

The is operator is irrelevant. Python will automatically promote a bytestring to a charstring. Consider the following:

>>> 'foo'+u'foo' 
u'foofoo'

This causes problems if the bytestring contains non-ascii characters:

>>> '\xa0'+u'foo'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

This is clearly a bug, but one that will simply never come up to someone who is primarily testing in us-ascii.

If python refused to convert bytestrings to charstrings this bug wouldn't exist.

> You can set a coding for a Python file, which defaults to ASCII, and that is what you're testing - that the Unicode representation of a string is equal to the ASCII representation.

No, you're not. You're testing whether a charstring contains the same characters as a bytestring. The problem is that bytestrings don't really contain characters:

>>>  u'\xa0' == '\xa0'
False

Python makes this compromise for legacy (pre-unicode) python code, and that's why Guido says he intends to fix it for Py3K. But this compromise creates bugs that programmers won't notice, and that's bad. If the language didn't treat bytestrings as us-ascii-encoded charstrings, 'foo' == u'foo' would fail and the programmer would notice immediately that something was being left non-unicode and go fix it.

97 Name: #!/usr/bin/anonymous : 2008-02-21 03:00 ID:Heaven

>>95

> Done. I shall now fully ignore this thread, for it gets stupider with every post.

I bailed in the 50s.

This thread has been closed. You cannot post in this thread any longer.