Branch: refs/heads/main
Home: https://github.com/dreamwidth/dreamwidth
Commit: b8e8ded3b31d3871f41a68cf0dc160db5ce18d94
https://github.com/dreamwidth/dreamwidth/commit/b8e8ded3b31d3871f41a68cf0dc160db5ce18d94
Author: Mark Smith mark@dreamwidth.org
Date: 2026-04-23 (Thu, 23 Apr 2026)
Changed paths:
M bin/search-tool
M cgi-bin/DW/Task/SearchCopier.pm
Log Message:
SearchCopier: rewrite as direct port of SphinxCopier patterns
The prior SearchCopier took its own shape — bulk selectall_arrayref,
ad-hoc chunking, per-doc log lines, wholesale DELETE-then-rebuild per
journal — and missed practices SphinxCopier has been using in prod for
years. Rewrite it as a near-mechanical port of SphinxCopier, with
Manticore-specific deviations only where Manticore's semantics require
them.
What's now matched with SphinxCopier:
- work() dispatch structure, arg shape (jitemid, jtalkid, jitemids,
jtalkids, full recopy), and log messages at INFO
- sphinx_db()/manticore_db() opens the connection with a SET NAMES
'utf8' and errstr check
- logcroak() after every query against the cluster DB and the search
DB, so failures fail the task loudly and the queue retries
- Full-recopy entry pass diffs dw1 vs log2 and batch-deletes missing
jitemids; does NOT wipe the whole journal up front. Search stays
available for the journal during the recopy.
- Full-recopy comment pass has the "short path" for
[
Error: Irreparable invalid markup ('<chunk_size [...] (keeps>') in entry. Owner must fix manually. Raw contents below.]
<p>Branch: refs/heads/main
Home: https://github.com/dreamwidth/dreamwidth
Commit: b8e8ded3b31d3871f41a68cf0dc160db5ce18d94
https://github.com/dreamwidth/dreamwidth/commit/b8e8ded3b31d3871f41a68cf0dc160db5ce18d94
Author: Mark Smith <a href="mailto:mark@dreamwidth.org">mark@dreamwidth.org</a>
Date: 2026-04-23 (Thu, 23 Apr 2026)</p>
<p>Changed paths:
M bin/search-tool
M cgi-bin/DW/Task/SearchCopier.pm</p>
<p>Log Message:</p>
<hr />
<p>SearchCopier: rewrite as direct port of SphinxCopier patterns</p>
<p>The prior SearchCopier took its own shape — bulk selectall_arrayref,
ad-hoc chunking, per-doc log lines, wholesale DELETE-then-rebuild per
journal — and missed practices SphinxCopier has been using in prod for
years. Rewrite it as a near-mechanical port of SphinxCopier, with
Manticore-specific deviations only where Manticore's semantics require
them.</p>
<p>What's now matched with SphinxCopier:</p>
<ul>
<li>work() dispatch structure, arg shape (jitemid, jtalkid, jitemids,
jtalkids, full recopy), and log messages at INFO</li>
<li>sphinx_db()/manticore_db() opens the connection with a SET NAMES
'utf8' and errstr check</li>
<li>logcroak() after every query against the cluster DB and the search
DB, so failures fail the task loudly and the queue retries</li>
<li>Full-recopy entry pass diffs dw1 vs log2 and batch-deletes missing
jitemids; does NOT wipe the whole journal up front. Search stays
available for the journal during the recopy.</li>
<li>Full-recopy comment pass has the "short path" for <CHUNK_SIZE
comments (inline) and the "mass-copy" path that dispatches chunks
of CHUNK_SIZE jtalkids per sub-task (keeps any single SQS task
well under message_timeout_secs)</li>
<li>Comment chunk processor: categorize into delete ('D') vs live (with
force_private for 'S'/unknown), sub-batch text fetches, batch
DELETE the 'D' set at the end</li>
<li>After-entries comment discovery (union of jtalkids in dw1 and in
talk2 for each touched jitemid), dispatching single-comment copies</li>
<li>next-unless-row safety checks after per-entry selectrow_hashref</li>
<li>24h memcache throttle on full recopies</li>
</ul>
<p>Manticore-specific deviations (called out in a comment at the top):</p>
<ul>
<li>Writes to dw1 via SphinxQL on @LJ::MANTICORE, not items_raw on
sphinx_search. No %DBINFO entry — opens raw DBI.</li>
<li>No stable doc IDs: Manticore auto-assigns. Upsert is DELETE+INSERT
on (journalid, jitemid, jtalkid) instead of REPLACE with preserved
id. (Checked the schema: jitemid is MEDIUMINT, jtalkid is INT, so
a deterministic-docid packing would require stealing bits from
journalid — not safe, not worth the round-trip savings.)</li>
<li>Body text stored uncompressed (no COMPRESS()); Manticore tokenizes
raw UTF-8.</li>
<li>security_bits is an rt_attr_multi literal (1,2,3), not a CSV
string column.</li>
<li>No touchtime attribute (unused by the read path in RT mode).</li>
<li>All integer filter values are interpolated into SphinxQL via
sprintf %d; '?' placeholders are reserved for rt_field text
columns. Manticore's SphinxQL binds '?' as a string and refuses
string filters on uint attributes.</li>
</ul>
<p>Logging volume:</p>
<p>Per-doc "Inserted post #N" / "Inserting comment #N" messages have
been collapsed to one summary per copy_entry/copy_comment call:
"Inserted N posts (#min-#max) for user(id)." Same for comments.
Deletes keep their existing "Actually deleted N posts." summary.</p>
<p>Force flag:</p>
<p>work() accepts <code>force => 1</code> in its args to bypass the 24h recopy
throttle. search-tool passes this on <code>import-user</code> (CLI invocations
are always explicit) and accepts a <code>--force</code> flag on <code>import-all</code>
for operator-initiated rebuilds that should ignore recent-copy
state. Routine queue-triggered full recopies leave force unset so
the throttle continues to protect against stampedes.</p>
<p>Co-Authored-By: Claude Opus 4.7 (1M context) <a href="mailto:noreply@anthropic.com">noreply@anthropic.com</a></p>
<p>To unsubscribe from these emails, change your notification settings at https://github.com/dreamwidth/dreamwidth/settings/notifications</p>