Add PersistentProgramCache (sqlite + filestream backends)#1912
Open
cpcloud wants to merge 2 commits intoNVIDIA:mainfrom
Open
Add PersistentProgramCache (sqlite + filestream backends)#1912cpcloud wants to merge 2 commits intoNVIDIA:mainfrom
cpcloud wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
de57bd8 to
ac38a68
Compare
|
5887554 to
1b24442
Compare
Convert cuda.core.utils to a package and add persistent, on-disk caches
for compiled ObjectCode produced by Program.compile.
Public API (cuda.core.utils):
* ProgramCacheResource -- abstract bytes|str -> ObjectCode mapping
with context manager and pickle-safety warning. Path-backed
ObjectCode is rejected at write time (would store only the path).
* SQLiteProgramCache -- single-file sqlite3 backend (WAL mode,
autocommit) with LRU eviction against an optional size cap. A
threading.RLock serialises connection use so one cache object is
safe across threads. wal_checkpoint(TRUNCATE) + VACUUM run after
evictions so the size cap bounds real on-disk usage. __contains__
is read-only -- it does not bump LRU. __len__ counts only entries
that survive validation and prunes corrupt rows. Schema-version
mismatch on open drops the tables and rebuilds; corrupt /
non-SQLite files are detected and the cache reinitialises empty.
Transient OperationalError ("database is locked") propagates
without nuking the file (and closes the partial connection).
* FileStreamProgramCache -- directory of atomically-written entries
(tmp + os.replace) safe across concurrent processes. On-disk
filenames are blake2b(32) hashes of the key so arbitrary-length
keys never overflow filesystem name limits. Reader pruning is
stat-guarded: only delete a corrupt-looking file if its inode/
size/mtime have not changed since the read, so a concurrent
os.replace by a writer is preserved. clear() and _enforce_size_cap
use the same stat guard. Stale temp files (older than 1 hour) are
swept on open and during eviction; live temp files count toward
the size cap. Windows ERROR_SHARING_VIOLATION (32) and
ERROR_LOCK_VIOLATION (33) on os.replace are retried with bounded
backoff (~185ms) before being treated as a non-fatal cache miss;
other PermissionErrors and all POSIX failures propagate. __len__
matches __getitem__ semantics (rejects schema/key/value mismatch).
* make_program_cache_key -- stable 32-byte blake2b key over code,
code_type, ProgramOptions, target_type, name expressions, cuda
core/NVRTC versions, NVVM lib+IR version, linker backend+version
for PTX inputs (driver version included only on the cuLink path).
Backend-specific gates mirror Program/Linker:
* code_type lower-cased to match Program_init.
* code_type/target_type combination validated against Program's
SUPPORTED_TARGETS matrix.
* NVRTC side-effect options (create_pch, time, fdevice_time_trace)
and external-content options (include_path, pre_include, pch,
use_pch, pch_dir) require an extra_digest from the caller. The
per-field set/unset predicate (_option_is_set) mirrors the
compiler's emission gates; collections.abc.Sequence is the
is_sequence check, matching _prepare_nvrtc_options_impl.
* NVVM use_libdevice=True requires extra_digest because libdevice
bitcode comes from the active toolkit. extra_sources is
rejected for non-NVVM. Bytes-like ``code`` is rejected for
non-NVVM (Program() requires str there).
* PTX (Linker) input options are normalised through per-field
gates that match _prepare_nvjitlink_options /
_prepare_driver_options. ftz/prec_div/prec_sqrt/fma collapse
to a sentinel under the driver linker (it ignores them).
ptxas_options canonicalises across str/list/tuple/empty shapes.
The driver linker's hard rejections (time, ptxas_options,
split_compile) raise at key time.
* name_expressions are gated on backend == "nvrtc"; PTX/NVVM
ignore them, matching Program.compile.
* Failed environment probes mix the exception class name into a
*_probe_failed label so broken environments never collide with
working ones, while staying stable across processes and across
repeated calls within a process.
Lazy import: ``from cuda.core.utils import StridedMemoryView`` does
NOT pull in the cache backends. The cache classes are exposed via
module __getattr__. sqlite3 is imported lazily inside
SQLiteProgramCache.__init__ so the package is usable on interpreters
built without libsqlite3.
Tests: 177 cache tests covering single-process CRUD, LRU/size-cap
(logical and on-disk, including stat-guarded race scenarios),
corruption + __len__ pruning, schema-mismatch table-DROP, threaded
SQLite, cross-process FileStream stress (writer/reader race exercising
the stat-guard prune; clear/eviction race injection via generator
cleanup), Windows vs POSIX PermissionError narrowing (winerror 32/33
swallow + retry, others propagate; partial-conn close on
OperationalError), lazy-import subprocess test, an end-to-end test
that compiles a real CUDA C++ kernel, stores the ObjectCode, reopens
the cache, and calls get_kernel on the deserialised copy, and a test
that parses _program.pyx via tokenize + ast.literal_eval to assert
the cache's _SUPPORTED_TARGETS_BY_CODE_TYPE matches Program.compile's
matrix. Public API is documented in cuda_core/docs/source/api.rst.
f1ae40e to
b27ed2c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cuda.core.utilsfrom a module to a package; expose cache APIs lazily via__getattr__sofrom cuda.core.utils import StridedMemoryViewstays lightweight.ProgramCacheResourceABC withbytes | strkeys, context manager, pickle-safety warning, and rejection of path-backedObjectCodeat write time.make_program_cache_key()— blake2b(32) digest with backend-specific gates that mirrorProgram/Linker:code_type/target_typeagainstProgram.compile'sSUPPORTED_TARGETS; rejects bytes-likecodefor non-NVVM andextra_sourcesfor non-NVVM.create_pch,time,fdevice_time_trace) and external-content (include_path,pre_include,pch,use_pch,pch_dir) options requireextra_digest; NVVMuse_libdevice=Truelikewise._prepare_nvjitlink_options/_prepare_driver_options;ptxas_optionscanonicalised across str/list/tuple/empty shapes; driver-linker hard rejections (time,ptxas_options,split_compile) raise at key time;ftz/prec_div/prec_sqrt/fmacollapse under driver linker.*_probe_failedlabel so broken environments never collide with working ones, while staying stable across processes and repeated calls.SQLiteProgramCache— single-file sqlite3 (WAL + autocommit), LRU eviction, optional size cap,wal_checkpoint(TRUNCATE) + VACUUMafter evictions so the cap bounds real on-disk usage.__contains__is read-only;__len__validates and prunes corrupt rows.threading.RLockserialises connection use. Schema-mismatch on open drops tables and rebuilds; corrupt / non-SQLite files reinitialise empty;OperationalError(lock/busy) propagates without nuking the file (and closes the partial connection).FileStreamProgramCache— multi-process via tmp +os.replace. Hash-based filenames so arbitrary-length keys don't overflow filesystem limits. Reader pruning,clear(), and_enforce_size_capare all stat-guarded (snapshot(ino, size, mtime_ns), refuse unlink on mismatch) so a concurrent writer'sos.replaceis preserved. Stale temp files swept on open; live temps count toward the size cap. WindowsERROR_SHARING_VIOLATION/ERROR_LOCK_VIOLATIONonos.replaceare retried with bounded backoff (~185ms) before being treated as a non-fatal cache miss; otherPermissionErrorand all POSIX failures propagate.__len__also rejectsstored_key/path mismatch.Program.compile(cache=...)integration is out of scope (tracked by #176/#179).Test plan
__len__pruning; schema-mismatch table-DROP; threaded SQLite (4 writers + 4 readers × 200 ops); cross-process FileStream stress (writer/reader race exercising the stat-guard prune; clear/eviction race injection via generator cleanup); Windows vs POSIXPermissionErrornarrowing (winerror 32/33 swallow + retry, others propagate; partial-conn close onOperationalError); lazy-import subprocess test;_SUPPORTED_TARGETS_BY_CODE_TYPEparity test that parses_program.pyxviatokenize+ast.literal_eval.get_kernelon the deserialisedObjectCode, parametrized over both backends.Closes #178