kareila: (Default)
kareila ([personal profile] kareila) wrote in [site community profile] changelog2010-06-29 07:37 pm

[dw-free] For XMLRPC, clear UTF8 flag from all strings, not just subject and event

[commit: http://hg.dwscoalition.org/dw-free/rev/c1e5d7213dfc]

http://bugs.dwscoalition.org/show_bug.cgi?id=2310

Before, we were only checking utf8 encoding for subject and event. It's
better to handle all utf8 strings from the request properly, using the
Encode module.

Codemerge from LiveJournal; prepared for Dreamwidth by [staff profile] denise.

Files modified:
  • cgi-bin/Apache/LiveJournal.pm
--------------------------------------------------------------------------------
diff -r 99ed26f60316 -r c1e5d7213dfc cgi-bin/Apache/LiveJournal.pm
--- a/cgi-bin/Apache/LiveJournal.pm	Wed Jun 30 02:12:57 2010 +0800
+++ b/cgi-bin/Apache/LiveJournal.pm	Tue Jun 29 14:37:20 2010 -0500
@@ -1804,6 +1804,7 @@ sub anti_squatter
 }
 
 package LJ::Protocol;
+use Encode();
 
 sub xmlrpc_method {
     my $method = shift;
@@ -1818,10 +1819,9 @@ sub xmlrpc_method {
     }
     my $error = 0;
     if (ref $req eq "HASH") {
-        foreach my $key ('subject', 'event') {
-            # get rid of the UTF8 flag in scalars
-            $req->{$key} = LJ::no_utf8_flag ( $req->{$key} )
-                if $req->{$key};
+        # get rid of the UTF8 flag in scalars
+        while ( my ($k, $v) = each %$req ) {
+            $req->{$k} = Encode::encode_utf8($v) if Encode::is_utf8($v);
         }
     }
     my $res = LJ::Protocol::do_request($method, $req, \$error);
--------------------------------------------------------------------------------
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2010-06-29 08:35 pm (UTC)(link)
This scares me a lot.

Are you sure that this does the same thing? I'm really not -- I don't think that encoding something with Encode is the same as saying 'Perl, I'm managing this myself, so please leave it alone'.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2010-06-29 08:42 pm (UTC)(link)
a) That doesn't give me much confidence.
b) This DOES give me confidence, but still, I would expect you to understand and be able to say 'yes, this works' since you committed it.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2010-06-29 08:53 pm (UTC)(link)
Nice!

I did read the comments to the bug, and they are pretty unclear. Sophie said "technically not the right way to do this" and then later you said "thumbs up on IRC". So, from my reading, there's no useful information in the comments to that bug. If there were, then I wouldn't have commented to you.

I did my due diligence, and then when I couldn't figure it out, I asked you. In a very nice, professional fashion. I would ask that you provide the same courtesy.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2010-06-29 09:26 pm (UTC)(link)
Actually, it was both nice and professional.

I have worked on this codebase for nearly a decade and I know which of the areas are most likely to cause problems and be poorly understood. Encoding is a top three on that list.

If you get the encoding wrong, or do something that Perl or the databases don't expect, the cleanup process is a giant mess. (Been there, done that.) It's difficult, time consuming, and things are just broken while we sort through it. It's also vastly misunderstood or not understood at all by the majority of devs. Very few people understand it, which isn't a bad thing, but it does mean changing this is difficult and -- yes -- quite a bit scary.

I expect the committers to understand what it is they are committing. You are signing off on that patch, saying, "If something breaks, you can call me too". If that's not the case, you shouldn't be putting your name to it. I know that some things are particularly difficult and hard to work with, encoding, clustering, user moves, translation, dversions, etc, that's fine -- not everybody can be an effective committer on everything. I'd rather have people stay away from patches they don't understand than just commit stuff and hope for the best.

At the end of the day, though, my #1 job here on Dreamwidth is to make sure that we have a stable, secure, available site. This means I will be questioning commits from time to time, and if that is going to bother you, we're going to have to address that and figure something out. I am never going to be able to stop questioning things that I think need to be examined.
sophie: A cartoon-like representation of a girl standing on a hill, with brown hair, blue eyes, a flowery top, and blue skirt. ☀ (Default)

[personal profile] sophie 2010-06-30 12:45 pm (UTC)(link)
For the record:

Perl's internal encoding that it tries to use is called "utf8" (no hyphen), and when the utf8 flag of a scalar is on, Perl treats the contents of that scalar as being in this encoding.

Calling Encode::encode_utf8($v) does the same as Encode::encode('utf8', $v);. What encode() actually does is it takes the encoding given, and returns a version of $v encoded as a series of bytes in that encoding, with the utf8 flag off. Because Perl treats utf8-flagged scalars as this encoding anyway, it's basically just flipping the flag off, but it's doing so in a way that's guaranteed to work with future versions of Perl.

The reason I said that it's technically not the right way to do this is because Perl's internal utf8 encoding is more lax than proper 'UTF-8' (hyphen), and as such it's possible that there would be invalid UTF-8 characters in a string encoded as 'utf8'. When outputting for display, strings should be encoded as 'UTF-8' (or its canonical name in Perl, 'utf-8-strict'), and not 'utf8'. As such, encode_utf8 is technically wrong, but it does what we need it to do in this case, which is to flip the flag without making any other changes, and it does so in a way that we know that any encode_utf8() calls in the code need to be changed as part of bug 443.

Hopefully that makes it clear why I okayed it.
alierak: (Default)

[personal profile] alierak 2010-06-30 03:35 am (UTC)(link)
In 5.10 IIRC we were no longer able to say 'please leave it alone', we had to say 'please temporarily treat it as a string of bytes, and then make me a copy that doesn't have the utf8 flag'. The semantics of encode_utf8 are something closer to 'make me a copy in Perl's utf8 encoding and return it as a sequence of bytes without the utf8 flag', which has the same result without being a kludge. I guess you could test it for a bunch of random byte sequences if you had to. The only differences I see are that we won't be making copies anymore when the original string doesn't have the utf8 flag, and we dropped compatibility for Perl < 5.8. Shrug.

That said, I think you missed the part where [personal profile] sophie wants us to stop managing it ourselves and actually use characters.

mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2010-06-30 08:40 am (UTC)(link)
Thank you for the explanation!

The thing I was missing is that encode_utf8 chains to encode which munges the SVf_UTF8 flag to be off. It was also throwing me off the scent to see that we added a is_utf8 check -- confusing me, because that changes the semantics of the loop. Before it was 'unconditionally turn this off' now it's 'turn this off if the data is already valid utf8'.

If the user submits non-utf8 data, it seems like the behavior of this loop is going to change. This may not actually affect us, though, because we might be doing checks for valid input earlier on or later... but still, it just looks like we're changing how it operates.

I didn't miss that part -- and I totally agree. But this seems like a fairly random swap from one method of accomplishing the utf8 flag removal to another method, with an added bonus of being odd code ('each' instead of iterating over 'keys'?), not following the style guide, and apparently being semantically different (with the is_utf8 check thrown in).

I fully admit that I only spent about 20 minutes looking at it. But that's all the time I had, at which point I threw an exception and asked for help to try to understand what this is doing.
alierak: (Default)

[personal profile] alierak 2010-06-30 11:55 am (UTC)(link)
is_utf8 doesn't check validity per se, only the utf8 flag, so no that's not a big change. I think [personal profile] kareila said most of the weirdness came from the fact that it's an LJ patch (and I still don't understand why they needed a new way of stripping the utf8 flag instead of calling the existing function), but she has been encouraging people to use 'each' in DW.