github: shadowy octopus with the head of a robot, emblazoned with the Dreamwidth swirl (Default)
github ([personal profile] github) wrote in [site community profile] changelog2026-04-23 01:36 am

[dreamwidth/dreamwidth] b8e8de: SearchCopier: rewrite as direct port of SphinxCopi...

Branch: refs/heads/main Home: https://github.com/dreamwidth/dreamwidth Commit: b8e8ded3b31d3871f41a68cf0dc160db5ce18d94 https://github.com/dreamwidth/dreamwidth/commit/b8e8ded3b31d3871f41a68cf0dc160db5ce18d94 Author: Mark Smith mark@dreamwidth.org Date: 2026-04-23 (Thu, 23 Apr 2026)

Changed paths: M bin/search-tool M cgi-bin/DW/Task/SearchCopier.pm

Log Message:


SearchCopier: rewrite as direct port of SphinxCopier patterns

The prior SearchCopier took its own shape — bulk selectall_arrayref, ad-hoc chunking, per-doc log lines, wholesale DELETE-then-rebuild per journal — and missed practices SphinxCopier has been using in prod for years. Rewrite it as a near-mechanical port of SphinxCopier, with Manticore-specific deviations only where Manticore's semantics require them.

What's now matched with SphinxCopier:

  • work() dispatch structure, arg shape (jitemid, jtalkid, jitemids, jtalkids, full recopy), and log messages at INFO
  • sphinx_db()/manticore_db() opens the connection with a SET NAMES 'utf8' and errstr check
  • logcroak() after every query against the cluster DB and the search DB, so failures fail the task loudly and the queue retries
  • Full-recopy entry pass diffs dw1 vs log2 and batch-deletes missing jitemids; does NOT wipe the whole journal up front. Search stays available for the journal during the recopy.
  • Full-recopy comment pass has the "short path" for
[Error: Irreparable invalid markup ('<chunk_size [...] (keeps>') in entry. Owner must fix manually. Raw contents below.]

<p>Branch: refs/heads/main Home: https://github.com/dreamwidth/dreamwidth Commit: b8e8ded3b31d3871f41a68cf0dc160db5ce18d94 https://github.com/dreamwidth/dreamwidth/commit/b8e8ded3b31d3871f41a68cf0dc160db5ce18d94 Author: Mark Smith <a href="&#x6D;&#97;&#105;&#108;&#x74;&#111;:&#109;&#97;r&#107;&#64;&#x64;&#x72;&#101;am&#119;&#x69;&#x64;&#x74;&#x68;&#46;&#x6F;&#x72;&#x67;">&#109;&#97;r&#107;&#64;&#x64;&#x72;&#101;am&#119;&#x69;&#x64;&#x74;&#x68;&#46;&#x6F;&#x72;&#x67;</a> Date: 2026-04-23 (Thu, 23 Apr 2026)</p> <p>Changed paths: M bin/search-tool M cgi-bin/DW/Task/SearchCopier.pm</p> <p>Log Message:</p> <hr /> <p>SearchCopier: rewrite as direct port of SphinxCopier patterns</p> <p>The prior SearchCopier took its own shape — bulk selectall_arrayref, ad-hoc chunking, per-doc log lines, wholesale DELETE-then-rebuild per journal — and missed practices SphinxCopier has been using in prod for years. Rewrite it as a near-mechanical port of SphinxCopier, with Manticore-specific deviations only where Manticore's semantics require them.</p> <p>What's now matched with SphinxCopier:</p> <ul> <li>work() dispatch structure, arg shape (jitemid, jtalkid, jitemids, jtalkids, full recopy), and log messages at INFO</li> <li>sphinx_db()/manticore_db() opens the connection with a SET NAMES 'utf8' and errstr check</li> <li>logcroak() after every query against the cluster DB and the search DB, so failures fail the task loudly and the queue retries</li> <li>Full-recopy entry pass diffs dw1 vs log2 and batch-deletes missing jitemids; does NOT wipe the whole journal up front. Search stays available for the journal during the recopy.</li> <li>Full-recopy comment pass has the "short path" for <CHUNK_SIZE comments (inline) and the "mass-copy" path that dispatches chunks of CHUNK_SIZE jtalkids per sub-task (keeps any single SQS task well under message_timeout_secs)</li> <li>Comment chunk processor: categorize into delete ('D') vs live (with force_private for 'S'/unknown), sub-batch text fetches, batch DELETE the 'D' set at the end</li> <li>After-entries comment discovery (union of jtalkids in dw1 and in talk2 for each touched jitemid), dispatching single-comment copies</li> <li>next-unless-row safety checks after per-entry selectrow_hashref</li> <li>24h memcache throttle on full recopies</li> </ul> <p>Manticore-specific deviations (called out in a comment at the top):</p> <ul> <li>Writes to dw1 via SphinxQL on @LJ::MANTICORE, not items_raw on sphinx_search. No %DBINFO entry — opens raw DBI.</li> <li>No stable doc IDs: Manticore auto-assigns. Upsert is DELETE+INSERT on (journalid, jitemid, jtalkid) instead of REPLACE with preserved id. (Checked the schema: jitemid is MEDIUMINT, jtalkid is INT, so a deterministic-docid packing would require stealing bits from journalid — not safe, not worth the round-trip savings.)</li> <li>Body text stored uncompressed (no COMPRESS()); Manticore tokenizes raw UTF-8.</li> <li>security_bits is an rt_attr_multi literal (1,2,3), not a CSV string column.</li> <li>No touchtime attribute (unused by the read path in RT mode).</li> <li>All integer filter values are interpolated into SphinxQL via sprintf %d; '?' placeholders are reserved for rt_field text columns. Manticore's SphinxQL binds '?' as a string and refuses string filters on uint attributes.</li> </ul> <p>Logging volume:</p> <p>Per-doc "Inserted post #N" / "Inserting comment #N" messages have been collapsed to one summary per copy_entry/copy_comment call: "Inserted N posts (#min-#max) for user(id)." Same for comments. Deletes keep their existing "Actually deleted N posts." summary.</p> <p>Force flag:</p> <p>work() accepts <code>force =&gt; 1</code> in its args to bypass the 24h recopy throttle. search-tool passes this on <code>import-user</code> (CLI invocations are always explicit) and accepts a <code>--force</code> flag on <code>import-all</code> for operator-initiated rebuilds that should ignore recent-copy state. Routine queue-triggered full recopies leave force unset so the throttle continues to protect against stampedes.</p> <p>Co-Authored-By: Claude Opus 4.7 (1M context) <a href="&#109;a&#x69;&#x6C;&#116;&#x6F;:&#x6E;&#111;r&#x65;&#x70;&#x6C;&#121;&#64;&#x61;&#110;&#x74;&#104;&#x72;&#111;p&#x69;&#99;&#x2E;&#99;&#111;&#x6D;">&#x6E;&#111;r&#x65;&#x70;&#x6C;&#121;&#64;&#x61;&#110;&#x74;&#104;&#x72;&#111;p&#x69;&#99;&#x2E;&#99;&#111;&#x6D;</a></p> <p>To unsubscribe from these emails, change your notification settings at https://github.com/dreamwidth/dreamwidth/settings/notifications</p>