Claude/analyze refactoring changes ov o fq#97
Merged
bernardladenthin merged 10 commits intomasterfrom Apr 26, 2026
Merged
Conversation
Delete the 96-line local copy of format_infill (utils.hpp:83-178). The upstream server-common.h:356 exports format_prompt_infill with an identical signature and byte-for-byte identical implementation. Call site in handleInfill (jllama.cpp) is a pure rename — no logic change. Future infill improvements in llama.cpp land automatically on upgrade. https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR
Delete the 19-line local format_rerank from utils.hpp. Upstream server-common.h:368 exports format_prompt_rerank with the same [BOS]q[EOS][SEP]doc[EOS] fallback logic plus support for models that ship a dedicated rerank chat template (llama_model_chat_template). handleRerank simplified: - Remove the tokenize_mixed + tokenize_input_prompts pre-processing step (upstream format_prompt_rerank calls tokenize_input_subprompt internally). - Pass the raw query and document strings directly. - task.tokens receives server_tokens from the upstream function; the manual server_tokens(..., false) wrap is gone. Also prune two now-unused includes from utils.hpp (build-info.h and mtmd-helper.h were only needed by the deleted format_infill/format_rerank). https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR
…batch loops Both handleRerank and handleEmbeddings build a vector of tasks with the same id/tokens/index/res_type assignment pattern. Extract it into a named helper and reduce each loop body to a single call. Makes the shared batch-dispatch shape visually identical across both handlers and keeps task construction in one testable place. https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR
Replace direct llama_vocab_fim_pre/suf/mid calls with the equivalent fields on server_context_meta (fim_pre_token, fim_sub_token, fim_mid_token). Upstream populates these from the exact same llama_vocab_fim_* calls in server-context.cpp:3120, so behaviour is unchanged. Brings the FIM check into the same idiom every other read of model state in handleInfill already uses (the meta.slot_n_ctx access right below). No size change; consistency win. https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR
…ACTORING_2.md configureParallelInference was a documented validate-and-no-op stub. This commit makes it actually reconfigure the live llama_context for the two fields that already have a public upstream setter: - n_threads → llama_set_n_threads(ctx, n, nb) - n_threads_batch → same call, second argument Single-field updates fill the missing value from the cached common_params captured at load time, and the cache is written back so follow-up partial updates read the just-applied value rather than the original. slot_prompt_similarity remains validated (so the existing out-of-range exception contract still holds) but is not yet applied — the field is private inside server_context_impl and upstream b8913 exposes no setter. The future call site is reserved as a comment; once the proposed upstream patch (tracked in llama-cpp.patch.md) lands and the pin advances past b8913, a follow-up commit uncomments the block. Also fixes a build regression introduced by commit e2c6d04 (which removed the build-info.h include from utils.hpp): jllama.cpp:668 calls llama_build_info() and now includes "build-info.h" directly instead of relying on the transitive include. Adds REFACTORING_2.md documenting the five commits in this branch and the remaining gap. C++ tests: 413 passing. https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR
Companion document to REFACTORING_2.md. Holds the proposed unified diff against ggerganov/llama.cpp tag b8913 that adds a public get_slot_prompt_similarity() / set_slot_prompt_similarity() pair on server_context, so embedders can mutate the slot-selection threshold at runtime instead of only at load_model() time. Tracked artefact, not a build input. Referenced from REFACTORING_2.md and from the reserved comment block in jllama.cpp::configureParallelInference. Once the PR merges and the pin advances past b8913, a tiny follow-up commit uncomments the block, completing the third configurable field. Includes: - rationale paragraph (PR description) - two-hunk unified diff (server-context.h + server-context.cpp) - manual-verification recipe with a self-contained C++ driver - land sequence covering pin bump and follow-up wire-up https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR
…entinel Fixes the CI failure on Ubuntu and macOS: RerankingModelTest.testRerankScoreRange:129 Score must be >= 0: -0.0013... The previous assertion required score in [0, 1] under the wrong mental model that rerank scores are probabilities. They are not — upstream returns the raw classification-head logit embd[0] (server-context.cpp send_rerank()) with no sigmoid applied. Negative logits are valid for weakly-matched (query, document) pairs. The CI break was triggered by commit e2c6d04 switching handleRerank from a doubled-BOS/EOS token sequence (produced by the deleted format_rerank + tokenize_input_prompts(..., add_special=true) chain) to upstream's canonical [BOS?] q [EOS?] [SEP?] doc [EOS?] format. The new format is correct; the old test bound was overly strict. Two changes to RerankingModelTest: 1. testRerankScoreRange — drop the [0, 1] bound; keep NaN/Inf checks; add |score| < 10 magnitude bound that catches token corruption without making any probability claim. 2. testRerankSpreadAndSign_canonicalFormatSentinel (new) — regression sentinel that asserts the canonical format produces a wide logit spread (max - min > 0.3) and that document sign tracks relevance (ML doc > 0; Paris doc < machine doc). A regression to the doubled-token format would compress scores into a narrow positive band and trip this test. Also clarifies the "Behavioural delta" paragraph in REFACTORING_2.md's Commit-2 entry to explain the doubled-BOS/EOS history, and adds a Commit-6 entry summarising this change. No production code modified. https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR
CI on Ubuntu and macOS reported Score spread implausibly small (0.19975334) — possible regression to doubled-token format The 0.3 threshold was an estimate; empirically the Jina-Reranker-v1-tiny model produces a canonical-format spread of ~0.20 across the four test documents (measured 0.19975 on Ubuntu, 0.19972 on macOS — bit-identical modulo quantisation rounding). Lower the threshold to 0.1. Comfortably below the observed ~0.20 (no sensitivity to per-platform rounding), well above any plausible noise-floor cluster, still aussagekräftig as "scores aren't all collapsed onto one point". The two other sentinel assertions (mlScore > 0; parisScore < machineScore) already passed — sign and ordering are correct. Only the absolute spread bound was too optimistic. https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR
… threshold) Stale claims fixed: - "Done — five commits" → "seven commits" (Commit 6 added in 718fbd1, Commit 7 added in 3b6c47b). - Commit 6 bullet: "max − min > 0.3" → describes initial 0.3 guess + calibration to 0.1 in Commit 7. - Tests section: "unchanged across all five commits" → "seven commits". - Tests section: removed wrong claim that Commit 2 produces "bit-identical output" for models without a rerank template — the canonical [BOS?] q [EOS?] [SEP?] doc [EOS?] sequence is deliberately NOT bit-identical to the doubled-BOS/EOS chain; that difference is exactly why testRerankScoreRange and the new sentinel were needed. New section: - Commit 7 — describes the empirical calibration of the sentinel spread threshold from 0.3 to 0.1, with the measured Ubuntu/macOS values (0.19975 / 0.19972). No code change. https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR
…ecord
The two-file split made sense while Pass 2 was in flight; now that all
seven Pass-2 commits have landed it reads more naturally as one
continuous record. Restructured REFACTORING.md:
- Header/Why/Baseline preserved.
- "Current state" updated to the post-Pass-2 numbers (1,725 lines
total; jllama.cpp 1,259; utils.hpp 74).
- "Phase log" renamed to "Pass 1 — server.hpp removal" so the existing
Phase 0-7 entries sit under a parent heading.
- New "Pass 2 — Second-pass deduplication at b8913" section absorbs
everything from REFACTORING_2.md: the why, the seven commits in a
single table, the Pass-2 reduction table, the
investigated-and-kept list, and the forward references to
llama-cpp.patch.md.
- "Code reduction achieved" replaced with a combined Pass-1 + Pass-2
table showing both intermediate and final line counts.
- "Known acceptable gap" updated to reflect the partial closure
(n_threads / n_threads_batch live; slot_prompt_similarity parked
behind the upstream patch).
Net change: REFACTORING_2.md (270 lines) deleted; REFACTORING.md grew
from 326 to 423 lines (the rest is dropped chrome / repeated
introductions).
No code change.
https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.