cuda.core: audit of C-style error-code return patterns in public API

## Motivation

cuda.core is intended to be a high-level Pythonic wrapper around lower-level bindings in cuda.bindings. In idiomatic Python, errors should be communicated via exceptions rather than requiring callers to inspect return values. This audit looked at all public functions and methods in cuda.core for places where the C convention of returning error/status codes leaks through — or more broadly, anywhere the caller must inspect the returned object for correctness rather than relying on exception flow.

## Summary

The codebase is largely well-designed. The `HANDLE_RETURN()` macro and `handle_return()` function consistently convert CUDA error codes into Python exceptions across the vast majority of the API. However, there are several notable deviations.

## Findings

### 1. `Event.is_done` — boolean derived from CUDA error code

:x: `_event.pyx`: Converts `CUDA_SUCCESS` → `True` and `CUDA_ERROR_NOT_READY` → `False`. The caller must inspect the return value rather than relying on exception flow. This is a common idiom in async GPU APIs and is arguably reasonable for polling, but it is worth noting as a deliberate deviation from pure exception-based error handling.

@mdboom comment: `cuEventQuery` docs say:

```
Returns ::CUDA_SUCCESS if all captured work has been completed, or
::CUDA_ERROR_NOT_READY if any captured work is incomplete.
```

So I think this code is correct.

### 2. `Program.pch_status` — string status code the caller must interpret

❓ `_program.pyx`: Returns `"created"`, `"not_attempted"`, `"failed"`, or `None`. The `"failed"` case is notable — PCH creation failure is reported as a string value rather than raised as an exception. The caller must know to check for `"failed"` and handle it. Internally, the helper `_read_pch_status()` also uses `None` as a sentinel for "heap exhausted, retry needed" (a classic C-style error pattern, though internal-only).

- #1994

### 3. `Linker.get_error_log()` / `get_info_log()` — unchecked CUDA calls

✔️ `_linker.pyx`: These return diagnostic strings, but the underlying CUDA calls to `nvJitLinkGetErrorLogSize` / `nvJitLinkGetErrorLog` are not checked via `HANDLE_RETURN` — the results are used directly without error checking. If these calls fail, the failure is silently ignored.

- #1993

### 4. `_MP_deallocate` silently swallows `CUDA_ERROR_INVALID_CONTEXT`

:x: `_memory_pool.pyx`: The deallocation path explicitly suppresses `CUDA_ERROR_INVALID_CONTEXT`. The function is marked `noexcept` so it cannot raise, but this means a real error (e.g., deallocating after context destruction) is silently ignored. Callers have no way to know deallocation failed.

@mdboom commentary: This seems correct as-is.  This is just an indication, according to docs that `::CUDA_ERROR_INVALID_CONTEXT (default stream specified with no current context)`.

### 5. `DeviceProperties._get_attribute()` returns a default on `CUDA_ERROR_INVALID_VALUE`

:x: `_device.pyx`: When querying device attributes, `CUDA_ERROR_INVALID_VALUE` (which often means "this attribute isn't supported on this GPU") is silently converted to a default value (typically `0`) rather than raising. A caller reading `device.properties.some_attribute` could get `0` and not know whether the attribute is genuinely 0 or unsupported on their hardware.

@mdboom commentary: This seems correct -- this is basically creating `dict.get()`-like functionality (with a default value when the key doesn't exist) on top of `cuDeviceGetAttr`, which seems totally fine.

### 6. `Kernel._get_arguments_info()` uses `CUDA_ERROR_INVALID_VALUE` as end-of-list sentinel

:x: `_module.pyx`: Loops calling `cuKernelGetParamInfo` until it gets `CUDA_ERROR_INVALID_VALUE`, which it interprets as "no more parameters" rather than an error. This mirrors the C API convention. Any genuinely invalid-value error would also be silently consumed.

@mdboom commentary: Given that there is no API to retrieve the number of parameters for a kernel, this seems like the correct way to iterate over all of them.  This code is so core to everything, if this were an issue I'm pretty confident we would know about it.

### 7. `Device_resolve_device_id()` returns `0` on `CUDA_ERROR_INVALID_CONTEXT`

:x: `_device.pyx`: When no context exists, instead of raising, it defaults to device 0 (mimicking cudart behavior). This is an internal function but affects public API behavior — `Device(None)` silently falls back to device 0 rather than informing the caller there is no active context.

@mdboom commentary: This all seems solidly in "designed this way on purpose".

### 8. `DMR_mempool_get_access()` — returns magic strings instead of a typed enum

❓  `_device_memory_resource.pyx`: Returns `"rw"`, `"r"`, or `""`. The empty string `""` (meaning "no access") is a value the caller must check — attempting to use a buffer without access would only fail later at a less helpful point. A proper enum would make this more self-documenting and less error-prone.

@mdboom commentary: This seems fine as-is, as it's just telling the user what the permissions are.  But this function is not called internally from anywhere or tested, so it's unclear to me what the expected usage pattern is.

## Suggestions

Ranked roughly from most to least impactful:

1. **`Program.pch_status` returning `"failed"`** — consider raising an exception (or at least a warning) during `compile()` when PCH creation fails, rather than silently storing a status string the user must remember to check.
2. **`Linker.get_error_log()` / `get_info_log()`** — check the CUDA return values from the underlying log-retrieval calls via `HANDLE_RETURN`.
3. **`_MP_deallocate` suppressing `CUDA_ERROR_INVALID_CONTEXT`** — at minimum log a warning so failures are observable.
4. **`DeviceProperties` returning `0` for unsupported attributes** — consider raising `AttributeError` or returning a distinct sentinel so callers can distinguish "genuinely 0" from "not supported".
5. **`DMR_mempool_get_access`** — return a proper enum rather than magic strings.
6. **`Kernel._get_arguments_info()` end-of-list sentinel** — document or assert that `CUDA_ERROR_INVALID_VALUE` is only expected at the boundary, to avoid masking real errors.
7. **`Device_resolve_device_id()` defaulting to device 0** — consider raising when there is no active context, rather than silently choosing a device.

## Not flagged (correct patterns)

For completeness, these were reviewed and found to handle errors properly:

- `Graph.update()` — raises `CUDAError` with diagnostic info on `GRAPH_EXEC_UPDATE_FAILURE`
- `GraphBuilder.complete()` / `_instantiate_graph()` — raises `RuntimeError` with error reason
- `Event.__sub__()` — handles error codes inline but always raises exceptions with contextual messages
- All `close()` methods — delegate to C++ RAII handles; idempotent no-op behavior is standard
- All memory resource `allocate()` / `deallocate()` public methods — consistently use `HANDLE_RETURN` or `raise_if_driver_error()`
- All `Stream`, `Device`, `Context` public methods — consistently raise via `HANDLE_RETURN`
- All graph node factory methods — consistently raise via `HANDLE_RETURN`
- `system` subpackage functions — consistently raise `ValueError` / `RuntimeError` on failure


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda.core: audit of C-style error-code return patterns in public API #1951

Motivation

Summary

Findings

1. `Event.is_done` — boolean derived from CUDA error code

2. `Program.pch_status` — string status code the caller must interpret

3. `Linker.get_error_log()` / `get_info_log()` — unchecked CUDA calls

4. `_MP_deallocate` silently swallows `CUDA_ERROR_INVALID_CONTEXT`

5. `DeviceProperties._get_attribute()` returns a default on `CUDA_ERROR_INVALID_VALUE`

6. `Kernel._get_arguments_info()` uses `CUDA_ERROR_INVALID_VALUE` as end-of-list sentinel

7. `Device_resolve_device_id()` returns `0` on `CUDA_ERROR_INVALID_CONTEXT`

8. `DMR_mempool_get_access()` — returns magic strings instead of a typed enum

Suggestions

Not flagged (correct patterns)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cuda.core: audit of C-style error-code return patterns in public API #1951

Description

Motivation

Summary

Findings

1. Event.is_done — boolean derived from CUDA error code

2. Program.pch_status — string status code the caller must interpret

3. Linker.get_error_log() / get_info_log() — unchecked CUDA calls

4. _MP_deallocate silently swallows CUDA_ERROR_INVALID_CONTEXT

5. DeviceProperties._get_attribute() returns a default on CUDA_ERROR_INVALID_VALUE

6. Kernel._get_arguments_info() uses CUDA_ERROR_INVALID_VALUE as end-of-list sentinel

7. Device_resolve_device_id() returns 0 on CUDA_ERROR_INVALID_CONTEXT

8. DMR_mempool_get_access() — returns magic strings instead of a typed enum

Suggestions

Not flagged (correct patterns)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `Event.is_done` — boolean derived from CUDA error code

2. `Program.pch_status` — string status code the caller must interpret

3. `Linker.get_error_log()` / `get_info_log()` — unchecked CUDA calls

4. `_MP_deallocate` silently swallows `CUDA_ERROR_INVALID_CONTEXT`

5. `DeviceProperties._get_attribute()` returns a default on `CUDA_ERROR_INVALID_VALUE`

6. `Kernel._get_arguments_info()` uses `CUDA_ERROR_INVALID_VALUE` as end-of-list sentinel

7. `Device_resolve_device_id()` returns `0` on `CUDA_ERROR_INVALID_CONTEXT`

8. `DMR_mempool_get_access()` — returns magic strings instead of a typed enum