Skip to content

llmPredict built-in for LLM inference#2448

Open
kubraaksux wants to merge 8 commits intoapache:mainfrom
kubraaksux:llm-api
Open

llmPredict built-in for LLM inference#2448
kubraaksux wants to merge 8 commits intoapache:mainfrom
kubraaksux:llm-api

Conversation

@kubraaksux
Copy link
Copy Markdown

@kubraaksux kubraaksux commented Mar 11, 2026

Contains the llmPredict API implementation (Java pipeline + tests). The full benchmark framework is in #2431. Branch also carries an end-to-end native DML inference prototype under scripts/staging/llm-native/: HF→DML weight converter, GPT-2 pre-LN block (scripts/nn/layers/gpt2_layer.dml), NumPy reference, inference driver, and a three-way correctness harness (tools/compare_logits.py). DML logits match HuggingFace within float64 round-off at T=5 and T=128. Previously tracked in #2430 (closed due to branch history issue).

Register llmPredict through the full SystemDS compilation pipeline
(Builtins, Opcodes, Types, DMLTranslator, HOP, LOP, CP instruction).
LlmPredictCPInstruction sends HTTP POST to OpenAI-compatible servers
with configurable concurrency. Includes 10 tests (7 mock, 3 live).
Introduce forward_internal with optional causal masking via upper.tri
and log(1-mask) for -inf before column-wise softmax. Expose forward_causal
wrapper; keep forward() signature unchanged for existing callers.

Add DML + JUnit test verifying first-token invariance under causal mode
when future value tokens change.
Introduces scripts/staging/llm-native/ for running pretrained HF
transformer models in SystemDS. tools/convert_gpt2.py splits HF's
fused c_attn projection into W_Q/W_K/W_V, upcasts to float64, and
writes one CSV+MTD pair per matrix plus a manifest.json index.
Adapted from bert_layer.dml: pre-LN ordering, causal attention via
multi_attention::forward_causal, and no inner final-LN. Inference-
only; GELU hardcoded to gelu_new.
Pure-NumPy float64 forward over the converter's CSVs, dumping every
intermediate hidden state. --compare-hf cross-checks against HF;
on gpt2 (124M) every per-step max-abs-diff is below 1e-11.
Reads stacked CSV weights produced by tools/convert_gpt2.py +
tools/pack_weights.py and runs a single forward pass, writing
logits.csv plus per-block dumps. On gpt2 (124M), worst max-abs-diff
vs the NumPy oracle is 4.5e-13, so DML logits match HF to ~1e-12.

pack_weights.py exists because DML's read() requires const-string
filenames; stacking per-layer matrices lets the driver row-slice
inside the loop instead.

Note: SystemDS/Hadoop's FileInputFormat silently skips files whose
names start with '_' or '.', so input files must not use those
prefixes.
compare_logits.py tokenizes a prompt, runs HF in float64, runs the
NumPy oracle, optionally invokes the DML driver via subprocess, and
prints a per-step max-abs-diff table across the three.

Validation on gpt2 (124M): worst max|d| = 4.09e-12 at T=5 and
7.73e-12 at T=128 -- the diff stays in the float64 round-off regime
and does not grow with sequence length.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant