Skip to content

feat(graphile-llm): wire billing metering, inference logging, and global routes#1192

Merged
pyramation merged 21 commits into
mainfrom
feat/llm-billing-metering
May 21, 2026
Merged

feat(graphile-llm): wire billing metering, inference logging, and global routes#1192
pyramation merged 21 commits into
mainfrom
feat/llm-billing-metering

Conversation

@pyramation
Copy link
Copy Markdown
Contributor

@pyramation pyramation commented May 18, 2026

Summary

Wire billing metering into the graphile-llm plugin system and LLM REST API endpoints. This PR adds:

  1. ModuleConfigCache<T> — generic LRU cache added to graphile-cache for module config lookups. Replaces 3 ad-hoc caching implementations (2 unbounded Maps, 1 raw LRUCache) with a single bounded, TTL-based cache class.

  2. Metering plugin (metering-plugin.ts) — wraps the embedder function with billing quota checks and usage recording. Pre-checks quota before embedding, records actual char-based usage after.

  3. Billing config cache (config-cache.ts) — discovers billing module function names (record_usage, check_billing_quota) at runtime per database, cached via ModuleConfigCache.

  4. Agent discovery (agent-discovery-plugin.ts) — discovers agent tables from agent_chat_module config table at runtime, cached via ModuleConfigCache.

  5. LLM REST API (llm-api.ts) — Express router with entity-scoped (/v1/orgs/:entity_id/threads) and global (/v1/threads, bills to actor_id from JWT) routes. SSE streaming chat with metering, token estimation, and inference logging.

  6. Environment config (env.ts) — single source of truth for LLM defaults (provider, model, baseUrl) via getLlmEnvOptions().

  7. Latency telemetry — every LLM operation (embed, chat, search) records latency_ms in console logs.

Cache standardization

All 3 caches in graphile-llm now use ModuleConfigCache<T> from graphile-cache:

Cache TTL Max Was
agent-discovery 60s 100 Unbounded Map + manual expiry
billing-config 5min 50 Raw LRUCache
inference-log 60s 100 Unbounded Map + manual expiry

Agent message schema

agent_message stores only model (text) for display. Full usage data (input_tokens, output_tokens, total_tokens, latency_ms, provider, status, etc.) is recorded in inference_log — the canonical source of truth for all LLM usage. See companion PR constructive-db#1275.

Review & Testing Checklist for Human

  • Verify ModuleConfigCache is correctly bounded — check that lru-cache eviction works when max entries exceeded
  • Verify metering plugin correctly returns null (quota exceeded signal) when check_billing_quota rejects
  • Test the global /v1/threads route bills to actor_id from JWT (no entity_id in URL)
  • Verify SSE streaming works end-to-end with a real Ollama instance

Notes

  • generateWithUsage from @agentic-kit/ollama is NOT used — pending team approval. Token counts use Math.ceil(text.length / 4) estimation.
  • apiKey / api_key_ref support was removed (tracked in constructive-planning#905)
  • Cache standardization tracked in constructive-planning#906

Link to Devin session: https://app.devin.ai/sessions/2b5a29d83d3f478e8d3d972653b4879c
Requested by: @pyramation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@devin-ai-integration devin-ai-integration Bot force-pushed the feat/llm-billing-metering branch from 3c6b433 to 3a257e3 Compare May 20, 2026 03:09
@devin-ai-integration devin-ai-integration Bot changed the title feat(graphile-llm): wire billing metering into LLM plugins feat(graphile-llm): billing metering + agent discovery + REST streaming endpoint May 20, 2026
@devin-ai-integration devin-ai-integration Bot changed the title feat(graphile-llm): billing metering + agent discovery + REST streaming endpoint feat(graphile-llm): billing metering + agent discovery + REST streaming endpoint + env config May 20, 2026
pyramation added 12 commits May 20, 2026 22:29
Add per-database billing integration to the graphile-llm package:

- config-cache.ts: LRU cache (5-min TTL, 50 entries) for billing_module
  metadata and API key resolution from app_secrets per database_id
- metering.ts: billing-aware wrappers (meteredEmbed, meteredChat) that
  call check_billing_quota() before and record_usage() after LLM calls
- LlmModulePlugin: exposes metering options on the build context
- LlmTextSearchPlugin: metered embedding with graceful degradation —
  when quota exceeded, skips vector path (text-only search continues)
- LlmTextMutationPlugin: metered embedding that throws QuotaExceededError
  on mutations (can't silently skip writing a vector the user asked for)
- MeteringConfig on GraphileLlmOptions: configurable meter slugs,
  estimated tokens, and skip toggle. Auto-detects billing_module.
- Uses Graphile withPgClient pattern for all billing SQL calls

Billing functions (check_billing_quota, record_usage) are resolved from
the tenant database's billing_module metaschema. When billing is not
provisioned, all calls pass through unmetered.
…Plugin

- Extract all billing/metering logic into metering-plugin.ts
- LlmModulePlugin, TextSearchPlugin, TextMutationPlugin are now pure
  (no billing imports, no metering context building)
- LlmMeteringPlugin uses AsyncLocalStorage to transparently wrap the
  embedder with quota checks — downstream plugins are unaware of billing
- Entity ID resolved via configurable callback (default: jwt.claims.user_id)
  instead of non-existent jwt.claims.membership_id
- Metering is opt-in: only loaded when metering option is truthy
- Add schema-existence guard in config-cache (checks metaschema_modules_public
  exists before querying billing_module table)
- Graceful degradation: missing schema, missing entity_id, or failed
  billing calls all result in unmetered passthrough
- LlmModulePlugin now exposes llmEmbeddingModel and llmChatModel on build
- LlmMeteringPlugin reads model names from build and uses them as default
  meter slugs (e.g. 'text-embedding-3-small' → billing meters table)
- Three-level waterfall: per-model → inference pool → universal credits
  (handled by billing module's category_meter field)
- Remove hardcoded 'embedding_tokens'/'chat_tokens' defaults
- Add docs/spec/llm-metering.md — full architecture reference for
  two-tier billing, model=meter slug convention, and waterfall
…user_module

The table was renamed from metaschema_modules_public.encrypted_secrets_module
to metaschema_modules_public.config_secrets_user_module. Also updated the
JOIN column from private_schema_id to schema_id to match the new schema.
…t length

Removed the configurable estimatedEmbeddingTokens option — token counts
are now estimated directly from the input text length (~4 chars/token).
No tokenizer needed since the billing system uses tokens as abstract
units and the credit_cost per model normalizes relative expense.
…point

- Add LlmAgentDiscoveryPlugin: discovers @agentThread/@agentMessage/@agentTask
  tables at PostGraphile schema build time via smart tags and caches results
  for REST middleware consumption
- Add LLM API router (llm-api.ts): Express router providing:
  - POST /orgs/:entity_id/threads — create agent thread (RLS enforced)
  - POST /orgs/:entity_id/threads/:thread_id/messages — send message + SSE
    stream assistant response via OllamaClient, persist both messages
- Wire into server.ts after authenticate middleware, before graphile
- Export plugin, getAgentDiscovery, types from graphile-llm
- Add enableAgentDiscovery option to GraphileLlmPreset (default: true)
- Add graphile-llm and @agentic-kit/ollama deps to graphql-server
Restores:
- Embedder resolution priority comments in llm-module-plugin.ts
- v4-style resolver wrapping explanation in text-mutation-plugin.ts
- Section markers (// Only intercept, // Find vector columns, etc.)
- graphile-postgis pattern reference in text-search-plugin.ts
- Array handling comment (AND/OR) in embedTextInWhere
The ~4 chars/token heuristic was unreliable for non-English text and code.
- Pre-check now passes amount=1 (just checks 'has any quota remaining?')
- Post-call records actual char count as the metered amount
- Token-accurate billing will come when providers return usage metadata
Add env.ts as the single source of truth for all LLM environment
variables and defaults (provider, model, baseUrl for both embedding
and chat). Remove all scattered process.env reads and null coalescing
from embedder.ts, chat.ts, llm-module-plugin.ts, and llm-api.ts.

- Remove @constructive-io/graphql-env dependency from graphile-llm
- Defaults: ollama provider, nomic-embed-text / llama3 models,
  http://localhost:11434 base URL
- Export getLlmEnvOptions from package index
- Update tests for new default behavior (env always resolves)
Agent table discovery now queries metaschema_modules_public.agent_chat_module
at runtime instead of scanning for @agentThread/@agentMessage/@agentTask
smart tags at PostGraphile schema build time.

Changes:
- agent-discovery-plugin: replaced GraphileConfig.Plugin build hook with
  async getAgentDiscovery(pool, dbname) that queries the module config table
  with a 60s TTL cache
- llm-api: router no longer takes getDiscovery param; calls getAgentDiscovery
  directly with the per-request pool
- server.ts: simplified createLlmApiRouter() call (no args)
- preset.ts: removed LlmAgentDiscoveryPlugin from plugin list
- types.ts: removed enableAgentDiscovery option
@devin-ai-integration devin-ai-integration Bot force-pushed the feat/llm-billing-metering branch from b5d6099 to 9c0265c Compare May 20, 2026 22:31
API key management will be handled as a separate, intentional feature.
Removed:
- apiKey from EmbedderConfig and ChatConfig types
- api_key_ref from LlmModuleData interface
- resolveApiKey() and SECRETS_MODULE_SCHEMA_SQL from config-cache
- apiKey field from LlmBillingCacheEntry
Major changes:
- Add global routes: POST /v1/threads, POST /v1/threads/:thread_id/messages
  These bill to actor_id from JWT (no entity_id in URL needed)
- Wire billing metering: check_billing_quota pre-check before LLM call,
  record_usage with actual token counts after response
- Store token_usage jsonb on assistant messages (input, output, totalTokens,
  model, latency_ms)
- Use callWithUsage() wrapper: uses generateWithUsage() when available
  (@agentic-kit/ollama >=1.3.0), falls back to generate() + token estimation
- Refactor: extract shared handleCreateThread/handleSendMessage handlers
  used by both entity-scoped and global routes
Discovers inference_log_module config at runtime (60s TTL cache),
and INSERTs into the inference log table after each LLM call with:
entity_id, actor_id, model, provider, request_type, input/output/total
tokens, latency_ms, and status.

Both streaming and non-streaming paths log (fire-and-forget).
Gracefully skips if inference_log_module is not provisioned.
Tests:
- Route registration: verifies all 4 routes (entity-scoped + global)
- Auth: both global and entity-scoped routes reject missing JWT
- Token usage: non-streaming response includes usage object
  (prompt_tokens, completion_tokens, total_tokens)
@devin-ai-integration devin-ai-integration Bot changed the title feat(graphile-llm): billing metering + agent discovery + REST streaming endpoint + env config feat(graphile-llm): wire billing metering, inference logging, and global routes May 20, 2026
…ime check

Per team discussion — use existing generate() API with token estimation
from text length (~4 chars/token). The callWithUsage wrapper remains as
a clean abstraction point for when a provider-native token counting API
is approved.
…lugin caches

- Add ModuleConfigCache<T> to graphile-cache — generic LRU cache with TTL,
  bounded memory, and named logging for module config lookups
- Migrate agent-discovery-plugin.ts from unbounded Map to ModuleConfigCache
- Migrate config-cache.ts (billing) from raw LRUCache to ModuleConfigCache
- Migrate llm-api.ts inference log cache from unbounded Map to ModuleConfigCache
- Replace direct lru-cache dependency in graphile-llm with graphile-cache
@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 21, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Added@​types/​babel__generator@​7.27.01001007280100
Added@​testing-library/​react@​11.2.510010010087100
Added@​tanstack/​react-query@​5.90.21991008899100
Added@​testing-library/​jest-dom@​5.11.1010010010089100

View full report

…ge INSERT

inference_log is the source of truth for usage data. Agent messages
now store only the model name for display — full token counts, latency,
and provider info are recorded in the inference_log table.
@pyramation pyramation merged commit 3349822 into main May 21, 2026
37 checks passed
@pyramation pyramation deleted the feat/llm-billing-metering branch May 21, 2026 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant