mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
Mark Smith ([staff profile] mark) wrote in [site community profile] changelog2009-03-12 07:44 am

[dw-free] Allow importing of your journal from another LiveJournal-based site.

[commit: http://hg.dwscoalition.org/dw-free/rev/0f9dbaa3f125]

http://bugs.dwscoalition.org/show_bug.cgi?id=114

This relatively small patch represents about a day's worth of work, sadly.
It turns out there's a pretty gnarly gotcha in the way the LJ codebase does
'syncitems' that can cause you to miss data. The fix is to step the
lastsync time backwards by a second.

Yes, this causes you to potentially get some extra data you don't want to
see. This also won't solve the problem in the case where more than 500
items share the same lastsync time. (In which case there is no fix.)

Also, we've decided to import deleted comments with no children, just to
make imported comment counts match up. Much easier than trying to explain
"well, some of your comments are not imported if they match this criteria."
(Of course, note that deleted comments don't have content - we merely insert
a placeholder.)

Patch by [staff profile] mark.

Files modified:
  • cgi-bin/DW/Worker/ContentImporter/LiveJournal/Comments.pm
  • cgi-bin/DW/Worker/ContentImporter/LiveJournal/Entries.pm
--------------------------------------------------------------------------------
diff -r 6115e6b86caf -r 0f9dbaa3f125 cgi-bin/DW/Worker/ContentImporter/LiveJournal/Comments.pm
--- a/cgi-bin/DW/Worker/ContentImporter/LiveJournal/Comments.pm	Wed Mar 11 05:36:29 2009 +0000
+++ b/cgi-bin/DW/Worker/ContentImporter/LiveJournal/Comments.pm	Thu Mar 12 07:44:36 2009 +0000
@@ -283,7 +283,6 @@ sub try_work {
             # rules we might skip a content with
             next if $comment->{done}; # Skip this comment if it was already imported this round
             next if $jtalkid_map->{$comment->{orig_id}}; # Or on a previous import round
-            next if $comment->{state} eq 'D' && !$comment->{has_children}; # Or if the comment is deleted, and child-less
 
             # now we know this one is going in the database
             $ct++;
diff -r 6115e6b86caf -r 0f9dbaa3f125 cgi-bin/DW/Worker/ContentImporter/LiveJournal/Entries.pm
--- a/cgi-bin/DW/Worker/ContentImporter/LiveJournal/Entries.pm	Wed Mar 11 05:36:29 2009 +0000
+++ b/cgi-bin/DW/Worker/ContentImporter/LiveJournal/Entries.pm	Thu Mar 12 07:44:36 2009 +0000
@@ -79,8 +79,17 @@ sub try_work {
     my $entry_map = DW::Worker::ContentImporter::Local::Entries->get_entry_map( $u ) || {};
     $log->( 'Loaded entry map with %d entries.', scalar( keys %$entry_map ) );
 
+    # this is a helper sub that steps a MySQL formatted time by some offset
+    # arguments: '2008-01-01 12:03:53', -1 ... returns '2008-01-01 12:03:52'
+    my $step_time = sub {
+        return LJ::mysql_time( LJ::mysqldate_to_time( $_[0] ) + $_[1] );
+    };
+
     # load the syncitems list; but never try to load the same lastsync time twice, just
-    # in case 
+    # in case.  also, we have to do some pretty annoying back-steps and not actually trust
+    # the last synced time because it's possible in some rare cases to lose entries by
+    # just trusting what the remote end is telling you.  (FIXME: link to a writeup of this
+    # somewhere...)
     my ( $lastsync, %tried_syncs, %sync );
     while ( $tried_syncs{$lastsync} < 2 ) {
         $log->( 'Calling syncitems; lastsync = %s.', ( $lastsync || 'undef' ) );
@@ -88,13 +97,22 @@ sub try_work {
         return $temp_fail->( 'XMLRPC failure: ' . $hash->{faultString} )
             if ! $hash || $hash->{fault};
 
+        open FILE, ">>/tmp/hashdump";
+        print FILE LJ::D( $hash );
+        close FILE;
+
         foreach my $item ( @{$hash->{syncitems} || []} ) {
             next unless $item->{item} =~ /^L-(\d+)$/;
-            $sync{$1} = [ $item->{action}, $item->{time} ];
-            $lastsync = $item->{time}
-                if !defined $lastsync || $item->{time} gt $lastsync;
-            $tried_syncs{$lastsync}++;
+
+            my $synctime = $step_time->( $item->{time}, -1 );
+
+            $sync{$1} = [ $item->{action}, $synctime ];
+            $lastsync = $synctime
+                if !defined $lastsync || $synctime gt $lastsync;
         }
+
+        # now we can mark this, as we have officially syncd this time
+        $tried_syncs{$lastsync}++;
 
         $log->( '    retrieved %d items and %d left to sync', $hash->{count}, $hash->{total} );
         last if $hash->{count} == $hash->{total};
@@ -132,7 +150,7 @@ sub try_work {
             # $tries, so we can break the 'broken client' logic (note: we assert that we are
             # not broken.)
             my @keys = sort { $sync{$a}->[1] cmp $sync{$b}->[1] } keys %sync;
-            $lastgrab = LJ::mysql_time( LJ::mysqldate_to_time( $sync{$keys[0]}->[1] ) - $tries );
+            $lastgrab = $step_time->( $sync{$keys[0]}->[1], -$tries );
 
             $log->( 'Loading entries; lastsync = %s.', $lastgrab );
             $hash = $class->call_xmlrpc( $data, 'getevents',
--------------------------------------------------------------------------------