WIP: pathfinder_compatibility_guard_rails#1977
Conversation
Introduce CompatibilityGuardRails plus related errors and tests so callers can opt into CTK and driver compatibility checks while reusing the existing pathfinder lookup APIs. Made-with: Cursor
Expose process_wide_compatibility_guard_rails at import time so follow-up changes can route the default cuda.pathfinder APIs through a stable public instance. Document the singleton and pin its public availability with a small regression test. Made-with: Cursor
Make the process-wide CompatibilityGuardRails instance the default path for the public load/find/locate APIs so top-level calls share compatibility state. Factor the routing/fallback/cache-reset glue into a dedicated internal module to keep `cuda.pathfinder.__init__` focused on the public surface, and fall back to the existing raw resolvers when v1 guard rails only have insufficient metadata. Made-with: Cursor
Allow CUDA_PATHFINDER_COMPATIBILITY_GUARD_RAILS to select strict, best_effort, or off behavior so we can experiment with stricter compatibility checks without changing the public API shape. Made-with: Cursor
Treat driver-packaged libraries as compatibility-neutral so strict mode can load NVML and other driver libs without a raw fallback, while CTK-backed artifacts remain the only items that establish and enforce the process-wide CTK anchor. Made-with: Cursor
Infer the CUDA Toolkit line from both wildcard-pinned and range-based cuda-toolkit requirements so strict process-wide guard rails keep working for editable wheel installs used by nvrtc and nvJitLink. Made-with: Cursor
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
|
Analysis of CI failures for workflow run Cursor GPT-5.4 Extra High Fast Findings
Why
Proper fix
|
Introduce a small toolkit-info utility that reads the CUDA_VERSION macro from cuda.h so follow-up guard-rails changes can infer CTK major.minor from toolkit headers without depending on version.json. Made-with: Cursor
Centralize encoded CUDA version parsing and validation so toolkit and driver version helpers stay aligned and cuda.h parsing gets consistent string conversion and error reporting. Made-with: Cursor
Replace version.json-based CTK root metadata with cuda.h parsing so compatibility checks use a simpler, more universal toolkit source while preserving wheel-based metadata inference. Made-with: Cursor
|
/ok to test |
|
At commit c6c38e3, the CI has a single failure in
That failure does not look like a
Spot-checking sibling logs shows that the underlying
So the most important takeaway from the logs is: the single red test is a combination of two conditions happening in the same job:
That explains why this shows up as only one visible failure even though the broader Issues to look into next:
|
This keeps the host-backed compatibility checks aligned with the resolved toolkit layout and skips cleanly when cudart headers or cuda.h are unavailable. Made-with: Cursor
This keeps the host-backed compatibility checks aligned with the actual driver state instead of relying on a fixed encoded version in the real-environment tests. Made-with: Cursor
|
/ok to test |
This keeps the real host-backed checks strict when an installed nvcc wheel actually ships nvcc, while avoiding a false failure in cu12 wheel environments that only provide lower-level compiler pieces such as ptxas. Made-with: Cursor
PromptI'm coming back to this question as posted under #1977 (comment)
Do we understand this now? ResponseMostly yes. The
That is exactly what the latest logs show:
So for the current failures on PR 1977, yes, we understand the flip. One nuance remains: the older single anomalous local job from the earlier archive, where a local job still got |
|
/ok to test |
Keep the ad hoc conda environment helpers aligned with the current cuda_pathfinder test dependency groups so local verification matches the package's test matrix. Split the PowerShell install list from Linux-only dependencies to avoid pulling unsupported packages on Windows. Made-with: Cursor
Keep the test environment helpers working across CUDA 12 and 13 by matching conda-forge solver constraints, so CUDA 12 uses Python 3.12 while CUDA 13 stays on Python 3.14. Made-with: Cursor
Keep the helper scripts focused on the CUDA packages they need from conda while `pip install --group test` owns the pytest dependencies, so the ad hoc environment setup stays easier to keep in sync with pyproject markers. Made-with: Cursor
Look for cudadevrt under both Library/lib/x64 and Library/lib so CUDA 12 conda environments resolve the real static library instead of falling through to a misleading CUDA_PATH error. Made-with: Cursor
|
Cursor GPT-5.4 Extra High Fast PR 1977 conda testing results analysisAnalyzed logs:
Overall
Compatibility guard rails
Static-lib fix
Comparing the two logs
Bottom line
|
|
I wanted to confirm that this PR solves a real-world problem that previously triggered a long Slack thread. After a few days of false starts, we were able to distill the issue to the following minimal reproducer: I set up my linux-64 workstation so that With current released pathfinderpip install cuda-pathfinder
pip install "cuda-toolkit[nvrtc]==13.*"Here,
That mismatch can lead to extremely confusing NVRTC runtime failures. This PRAfter switching to the pathfinder code in this PR at commit 5a35125, with everything else unchanged, the mismatch is diagnosed and reported immediately: This is exactly the kind of confusing mixed-environment situation the compatibility guard rails are meant to catch. With released pathfinder, the mismatch is silently accepted. With this PR, it is surfaced clearly and immediately, with a precise explanation of what resolved to CTK 13.2 and what resolved to CTK 13.0. |
Keep diagnostics and tests aligned with the configured default so future rollout changes only need one constant update. Made-with: Cursor
Align CompatibilityGuardRails with the PEP 440 version syntax users already know, and reuse packaging's parser instead of maintaining custom constraint logic. Made-with: Cursor
Separate item validation, pairwise CTK coherence, and driver checks so later component- and pipeline-aware rules can land without changing current guard-rails behavior. Co-authored-by: Cursor <cursoragent@cursor.com>
Record graph-derived dynamic-link groupings and cross-surface companion tags so later guard-rails milestones can add component- and pipeline-aware policy without reworking the catalogs or resolution plumbing. Co-authored-by: Cursor <cursoragent@cursor.com>
Require exact CTK matching only for authored same-component or companion relationships, so independent artifacts can coexist across minors. Add a Linux-only driver-compatibility override for forward-compatibility deployments without relaxing CTK-coherence checks. Co-authored-by: Cursor <cursoragent@cursor.com>
Query NVML for display-driver release metadata and use it to distinguish backward compatibility from NVIDIA's same-major minor-version compatibility. This lets guard rails follow published driver-branch thresholds instead of treating cuDriverGetVersion() as the whole driver story. Co-authored-by: Cursor <cursoragent@cursor.com>
Track declared nvrtc/nvJitLink producer-consumer flows so guard rails can apply NVIDIA's stricter LTOIR rules without over-constraining PTX, ELF, and CUBIN cases. Keep explicit nvvm pipelines conservative until the model can represent NVVM IR version and dialect details. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Tracking progress: With commit 6b4d910 we have reached the final milestone as layed out here: These commits were entirely generate with Cursor GPT-5.4 Extra High Fast (I only glanced through): |
|
/ok to test |
Skip Linux-only driver-forward-compatibility tests on non-Linux hosts and stop treating nvcc discovery as mandatory in see_what_works real-host checks. This keeps platform-specific expectations from obscuring real guard-rails regressions when CI infrastructure and host layouts vary. Co-authored-by: Cursor <cursoragent@cursor.com>
|
/ok to test |
Drop redundant mocked happy-path checks that now overlap with the real-host CI matrix, and add explicit ELF/CUBIN pipeline cases so the remaining mocks stay focused on platform, ordering, and version-corner behavior. This keeps the guard-rails suite easier to maintain without giving up the synthetic coverage that real installs still cannot exercise reliably. Co-authored-by: Cursor <cursoragent@cursor.com>
Move public/process-wide and real-host coverage into dedicated modules while centralizing shared fixtures. This keeps the core policy suite focused without changing guard-rails coverage. Co-authored-by: Cursor <cursoragent@cursor.com>
Share the guard-rails-off fixture and small CTK sandbox builders so the touched pathfinder tests stay easier to extend and less error-prone. Co-authored-by: Cursor <cursoragent@cursor.com>
Move static and bitcode caching to the shared locate layer so strict-mode public APIs reuse the same discovery boundary after process-wide guard-rails indirection. Add symmetric wrapper cache clears and a regression test that exercises the strict-mode path. Co-authored-by: Cursor <cursoragent@cursor.com>
|
/ok to test |
* Add nccl_device to _BITCODE_LIBS_PACKAGED_WITH so the guard-rails resolver layer no longer raises KeyError for a name that is already in SUPPORTED_BITCODE_LIBS; lock the dispatch tables in place with parametrized tests that walk every supported bitcode/static/binary name through _resolve_*_item. * Remove unreachable helpers _pipeline_compatibility_result, _dynamic_lib_pipeline_items, and CompatibilityGuardRails._enforce_declared_dynamic_lib_pipelines_for_pair. The pipeline check still fires from _enforce_declared_dynamic_lib_pipelines_for_item after _remember, which is the only code path that ever produced a result. * Re-export DriverCtkCompatibilityError from cuda.pathfinder so the driver-vs-CTK case (already advertised by the env-var hint) can be caught by type instead of message text, and list it in api.rst. Co-authored-by: Cursor <cursoragent@cursor.com>
…ion, and reset naming * Defer the platform check in CUDA_PATHFINDER_DRIVER_COMPATIBILITY to after the CUDA_PATHFINDER_COMPATIBILITY_GUARD_RAILS=off short-circuit so users who turn guard rails off entirely are no longer forced to also unset the override on non-Linux platforms. The value-validation RuntimeError still fires unconditionally so typos are caught early. * Move the binary packaged_with mapping next to the binary registry as SUPPORTED_BINARIES_PACKAGED_WITH and reclassify nsys / nsight-sys / ncu / nsight-compute as packaged_with="other" so strict-mode lookups for separately packaged Nsight tools no longer raise misleading "missing CTK metadata" errors. * Rename CompatibilityGuardRails._reset_for_testing to _reset_state and document that production cache_clear callers also drive it; configured driver overrides are intentionally re-applied while lazily-queried values are dropped. Co-authored-by: Cursor <cursoragent@cursor.com>
Low-severity polish on the v1 compatibility guard rails surface plus two new tests so the existing invariants are asserted instead of only code-read. - _owned_distribution_candidates: note that symlinks are intentionally not chased on either side of the path comparison. - _missing_ctk_metadata_message now appends the conflicting CTK set when wheel metadata for the same on-disk file matches more than one cuda-toolkit distribution, instead of silently collapsing to "could not determine the CTK version". - _compatible_pair_message picks distinct wording for the same-CTK vs cross-CTK independent-pair cases so the message is no longer misleading when both items share a CTK. - _declare_dynamic_lib_pipeline gains a docstring explaining why it stays single-underscored in v1 (taxonomy/policy still evolving). - Block comment near _STATIC_LIBS_PACKAGED_WITH / _BITCODE_LIBS_PACKAGED_WITH calls out the lockstep requirement with SUPPORTED_*_LIBS and points at the parametrized resolver tests that enforce coverage. - load_nvidia_dynamic_lib augments any CompatibilityCheckError raised during _register_and_check with a sentence explaining the underlying dlopen / LoadLibraryW already happened and the OS handle remains live. Mutates exc.args in place so subclass typing (DriverCtkCompatibilityError) and __cause__ are preserved. - _try_process_wide_guard_rails_then_fallback documents why the forward-compat hint is appended only on Linux (cuda-compat-* is NVIDIA's Linux-only contract). - New test_register_and_check_is_idempotent_for_repeated_items asserts duplicate ResolvedItem registrations collapse to one entry. - New test_driver_ctk_compatibility_error_is_typed_catchable asserts a driver-too-old failure raises DriverCtkCompatibilityError as itself (not just by message), is still a CompatibilityCheckError, and carries the new "OS handle remains live" augmentation. Co-authored-by: Cursor <cursoragent@cursor.com>
|
/ok to test |
Resolves #1038
Continuation of #1936
WIP — CI testing