## Summary
- Problem: `normalizeDirectiveWhitespace` applied whitespace-collapsing regexes globally, including inside fenced code blocks (` ``` ` / `~~~`) and indent-code-blocks (4-space / tab), corrupting indentation in assistant replies that contain code snippets
- Why it matters: Any language where indentation is significant (Python, Go, YAML, etc.) or visually meaningful would render incorrectly after stripping inline directive tags
- What changed: Stash code blocks under a Unicode private-use sentinel (`\uE000`) before normalization, run the existing prose regexes on the masked text, then restore the original blocks verbatim
- What did NOT change: All prose normalization rules are retained as-is (`\r\n`, multi-space collapse, leading blank-line strip, trailing whitespace, 3+ newline fold)
## Change Type
- [x] Bug fix
## Scope
- [ ] Gateway / orchestration
## Root Cause
- Root cause: Prose whitespace regexes were applied to the full text string with no awareness of Markdown code block boundaries
- Missing detection / guardrail: No tests covered indented content inside fenced blocks
- Contributing context: Directive tag stripping (`[[reply_to_current]]`, `[[audio_as_voice]]`) is applied before delivery, making the normalization step a silent corruption point for code-heavy replies
## Regression Test Plan
- Coverage level that should have caught this:
- [x] Unit test
- Target test or file: `src/utils/directive-tags.test.ts`
- Scenario the test should lock in: `parseInlineDirectives` with fenced/indent code blocks must preserve all leading whitespace inside those blocks
- Why this is the smallest reliable guardrail: Pure function with deterministic string in/out; no mocks needed
- If no new test is added, why not: 7 new unit tests added
## User-visible / Behavior Changes
Code blocks in assistant replies containing `[[reply_to_current]]` or `[[audio_as_voice]]` directives now retain correct indentation after the directive is stripped.
## Security Impact
- New permissions/capabilities? No
- Secrets/tokens handling changed? No
- New/changed network calls? No
- Command/tool execution surface changed? No
- Data access scope changed? No
## Compatibility / Migration
- Backward compatible? Yes
- Config/env changes? No
- Migration needed? No
Co-Authored-By: Codemax <codemax@binance.com>
- Two-pass line splitting: first slice at maxChars (unchanged for Latin),
then re-split only CJK-heavy segments at chunking.tokens. This preserves
the original ~800-char segments for ASCII lines while keeping CJK chunks
within the token budget.
- Narrow surrogate-pair adjustment to CJK Extension B+ range (D840–D87E)
only, so emoji surrogate pairs are not affected. Mixed CJK+emoji text
is now handled consistently regardless of composition.
- Add tests: emoji handling (2), Latin backward-compat long-line (1).
Addresses Codex P1 (oversized CJK segments) and P2s (Latin over-splitting,
emoji surrogate inconsistency).
- Use code-point length instead of UTF-16 length in estimateStringChars()
so that CJK Extension B+ surrogate pairs (U+20000+) are counted as 1
character, not 2 (fixes ~25% overestimate for rare characters).
- Change long-line split step from maxChars to chunking.tokens so that
CJK lines are sliced into token-budget-sized segments instead of
char-budget-sized segments that produce ~4x oversized chunks.
- Add tests for both fixes: surrogate-pair handling and long CJK line
splitting.
Addresses review feedback from Greptile and Codex bots.
The QMD memory system uses a fixed 4:1 chars-to-tokens ratio for chunk
sizing, which severely underestimates CJK (Chinese/Japanese/Korean) text
where each character is roughly 1 token. This causes oversized chunks for
CJK users, degrading vector search quality and wasting context window space.
Changes:
- Add shared src/utils/cjk-chars.ts module with CJK-aware character
counting (estimateStringChars) and token estimation helpers
- Update chunkMarkdown() in src/memory/internal.ts to use weighted
character lengths for chunk boundary decisions and overlap calculation
- Replace hardcoded estimateTokensFromChars in the context report
command with the shared utility
- Add 13 unit tests for the CJK estimation module and 5 new tests for
CJK-aware memory chunking behavior
Backward compatible: pure ASCII/Latin text behavior is unchanged.
Closes#39965
Related: #40216
* fix: preserve indentation when stripping reply directives
* fix: preserve word boundaries when stripping reply directives
* fix: drop separator space after leading reply directives
* fix: preserve indentation when stripping reply directives (#55960) (thanks @Nanako0129)
---------
Co-authored-by: Ayaan Zaidi <hi@obviy.us>