changelog | [dreamwidth/dreamwidth] 39a949: SearchCopier: stream from cluster DBs, match Sphin...

Branch: refs/heads/main Home: https://github.com/dreamwidth/dreamwidth Commit: 39a9497745cdab9f36c8d8cd669f7457a3595fd6 https://github.com/dreamwidth/dreamwidth/commit/39a9497745cdab9f36c8d8cd669f7457a3595fd6 Author: Mark Smith mark@dreamwidth.org Date: 2026-04-23 (Thu, 23 Apr 2026)

Changed paths: M cgi-bin/DW/Task/SearchCopier.pm

Log Message:

SearchCopier: stream from cluster DBs, match SphinxCopier logging

importfull was doing selectall_arrayref on both log2+logtext2 and talk2+talktext2 for the journal, which loads every row into perl memory before doing anything. Workers were OOMing on real-world accounts.

Switched both loops to prepare + execute + fetchrow_hashref with mysql_use_result=1 so DBD::mysql actually streams rather than buffering the full result set client-side. Also merged the old "fetch metadata, then fetch text in batches of 1000" comment path into a single talk2 + talktext2 join, since we're streaming now. Working memory is bounded at one row at a time plus the %entry_bits map (jitemid -> bits arrayref) kept around for comment security inheritance.

Also upgraded work()'s logging to match SphinxCopier's verbosity so it's actually possible to tell who a job is for from the logs: "Search copier started for [Unknown site tag](), source ." INFO "Requested copy of only entry ." INFO "Requested copy of only comment ." INFO "Requested complete recopy of user." INFO "Copied less than a day ago. Skipping." INFO Start/branch lines emit at INFO; end-of-run summary still DEBUG on clean success and WARN only when there were errors.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

To unsubscribe from these emails, change your notification settings at https://github.com/dreamwidth/dreamwidth/settings/notifications