Claude/analyze refactoring changes ov o fq by bernardladenthin · Pull Request #97 · bernardladenthin/java-llama.cpp

bernardladenthin · 2026-04-25T19:24:05Z

No description provided.

Delete the 96-line local copy of format_infill (utils.hpp:83-178). The upstream server-common.h:356 exports format_prompt_infill with an identical signature and byte-for-byte identical implementation. Call site in handleInfill (jllama.cpp) is a pure rename — no logic change. Future infill improvements in llama.cpp land automatically on upgrade. https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR

Delete the 19-line local format_rerank from utils.hpp. Upstream server-common.h:368 exports format_prompt_rerank with the same [BOS]q[EOS][SEP]doc[EOS] fallback logic plus support for models that ship a dedicated rerank chat template (llama_model_chat_template). handleRerank simplified: - Remove the tokenize_mixed + tokenize_input_prompts pre-processing step (upstream format_prompt_rerank calls tokenize_input_subprompt internally). - Pass the raw query and document strings directly. - task.tokens receives server_tokens from the upstream function; the manual server_tokens(..., false) wrap is gone. Also prune two now-unused includes from utils.hpp (build-info.h and mtmd-helper.h were only needed by the deleted format_infill/format_rerank). https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR

…batch loops Both handleRerank and handleEmbeddings build a vector of tasks with the same id/tokens/index/res_type assignment pattern. Extract it into a named helper and reduce each loop body to a single call. Makes the shared batch-dispatch shape visually identical across both handlers and keeps task construction in one testable place. https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR

Replace direct llama_vocab_fim_pre/suf/mid calls with the equivalent fields on server_context_meta (fim_pre_token, fim_sub_token, fim_mid_token). Upstream populates these from the exact same llama_vocab_fim_* calls in server-context.cpp:3120, so behaviour is unchanged. Brings the FIM check into the same idiom every other read of model state in handleInfill already uses (the meta.slot_n_ctx access right below). No size change; consistency win. https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR

…ACTORING_2.md configureParallelInference was a documented validate-and-no-op stub. This commit makes it actually reconfigure the live llama_context for the two fields that already have a public upstream setter: - n_threads → llama_set_n_threads(ctx, n, nb) - n_threads_batch → same call, second argument Single-field updates fill the missing value from the cached common_params captured at load time, and the cache is written back so follow-up partial updates read the just-applied value rather than the original. slot_prompt_similarity remains validated (so the existing out-of-range exception contract still holds) but is not yet applied — the field is private inside server_context_impl and upstream b8913 exposes no setter. The future call site is reserved as a comment; once the proposed upstream patch (tracked in llama-cpp.patch.md) lands and the pin advances past b8913, a follow-up commit uncomments the block. Also fixes a build regression introduced by commit e2c6d04 (which removed the build-info.h include from utils.hpp): jllama.cpp:668 calls llama_build_info() and now includes "build-info.h" directly instead of relying on the transitive include. Adds REFACTORING_2.md documenting the five commits in this branch and the remaining gap. C++ tests: 413 passing. https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR

Companion document to REFACTORING_2.md. Holds the proposed unified diff against ggerganov/llama.cpp tag b8913 that adds a public get_slot_prompt_similarity() / set_slot_prompt_similarity() pair on server_context, so embedders can mutate the slot-selection threshold at runtime instead of only at load_model() time. Tracked artefact, not a build input. Referenced from REFACTORING_2.md and from the reserved comment block in jllama.cpp::configureParallelInference. Once the PR merges and the pin advances past b8913, a tiny follow-up commit uncomments the block, completing the third configurable field. Includes: - rationale paragraph (PR description) - two-hunk unified diff (server-context.h + server-context.cpp) - manual-verification recipe with a self-contained C++ driver - land sequence covering pin bump and follow-up wire-up https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR

…entinel Fixes the CI failure on Ubuntu and macOS: RerankingModelTest.testRerankScoreRange:129 Score must be >= 0: -0.0013... The previous assertion required score in [0, 1] under the wrong mental model that rerank scores are probabilities. They are not — upstream returns the raw classification-head logit embd[0] (server-context.cpp send_rerank()) with no sigmoid applied. Negative logits are valid for weakly-matched (query, document) pairs. The CI break was triggered by commit e2c6d04 switching handleRerank from a doubled-BOS/EOS token sequence (produced by the deleted format_rerank + tokenize_input_prompts(..., add_special=true) chain) to upstream's canonical [BOS?] q [EOS?] [SEP?] doc [EOS?] format. The new format is correct; the old test bound was overly strict. Two changes to RerankingModelTest: 1. testRerankScoreRange — drop the [0, 1] bound; keep NaN/Inf checks; add |score| < 10 magnitude bound that catches token corruption without making any probability claim. 2. testRerankSpreadAndSign_canonicalFormatSentinel (new) — regression sentinel that asserts the canonical format produces a wide logit spread (max - min > 0.3) and that document sign tracks relevance (ML doc > 0; Paris doc < machine doc). A regression to the doubled-token format would compress scores into a narrow positive band and trip this test. Also clarifies the "Behavioural delta" paragraph in REFACTORING_2.md's Commit-2 entry to explain the doubled-BOS/EOS history, and adds a Commit-6 entry summarising this change. No production code modified. https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR

CI on Ubuntu and macOS reported Score spread implausibly small (0.19975334) — possible regression to doubled-token format The 0.3 threshold was an estimate; empirically the Jina-Reranker-v1-tiny model produces a canonical-format spread of ~0.20 across the four test documents (measured 0.19975 on Ubuntu, 0.19972 on macOS — bit-identical modulo quantisation rounding). Lower the threshold to 0.1. Comfortably below the observed ~0.20 (no sensitivity to per-platform rounding), well above any plausible noise-floor cluster, still aussagekräftig as "scores aren't all collapsed onto one point". The two other sentinel assertions (mlScore > 0; parisScore < machineScore) already passed — sign and ordering are correct. Only the absolute spread bound was too optimistic. https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR

… threshold) Stale claims fixed: - "Done — five commits" → "seven commits" (Commit 6 added in 718fbd1, Commit 7 added in 3b6c47b). - Commit 6 bullet: "max − min > 0.3" → describes initial 0.3 guess + calibration to 0.1 in Commit 7. - Tests section: "unchanged across all five commits" → "seven commits". - Tests section: removed wrong claim that Commit 2 produces "bit-identical output" for models without a rerank template — the canonical [BOS?] q [EOS?] [SEP?] doc [EOS?] sequence is deliberately NOT bit-identical to the doubled-BOS/EOS chain; that difference is exactly why testRerankScoreRange and the new sentinel were needed. New section: - Commit 7 — describes the empirical calibration of the sentinel spread threshold from 0.3 to 0.1, with the measured Ubuntu/macOS values (0.19975 / 0.19972). No code change. https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR

…ecord The two-file split made sense while Pass 2 was in flight; now that all seven Pass-2 commits have landed it reads more naturally as one continuous record. Restructured REFACTORING.md: - Header/Why/Baseline preserved. - "Current state" updated to the post-Pass-2 numbers (1,725 lines total; jllama.cpp 1,259; utils.hpp 74). - "Phase log" renamed to "Pass 1 — server.hpp removal" so the existing Phase 0-7 entries sit under a parent heading. - New "Pass 2 — Second-pass deduplication at b8913" section absorbs everything from REFACTORING_2.md: the why, the seven commits in a single table, the Pass-2 reduction table, the investigated-and-kept list, and the forward references to llama-cpp.patch.md. - "Code reduction achieved" replaced with a combined Pass-1 + Pass-2 table showing both intermediate and final line counts. - "Known acceptable gap" updated to reflect the partial closure (n_threads / n_threads_batch live; slot_prompt_similarity parked behind the upstream patch). Net change: REFACTORING_2.md (270 lines) deleted; REFACTORING.md grew from 326 to 423 lines (the rest is dropped chrome / repeated introductions). No code change. https://claude.ai/code/session_01PFjt82pRM3FDpPFpsNXEeR

claude added 10 commits April 25, 2026 19:19

bernardladenthin merged commit 24918e4 into master Apr 26, 2026
16 checks passed

bernardladenthin deleted the claude/analyze-refactoring-changes-ovOFq branch April 26, 2026 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude/analyze refactoring changes ov o fq#97

Claude/analyze refactoring changes ov o fq#97
bernardladenthin merged 10 commits intomasterfrom
claude/analyze-refactoring-changes-ovOFq

bernardladenthin commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bernardladenthin commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants