Skip to content

Dataset untag#323

Open
PGijsbers wants to merge 16 commits intomainfrom
dataset-untag
Open

Dataset untag#323
PGijsbers wants to merge 16 commits intomainfrom
dataset-untag

Conversation

@PGijsbers
Copy link
Copy Markdown
Contributor

@PGijsbers PGijsbers commented May 3, 2026

Adds the dataset untag functionality. Introduces two endpoints:

  • POST /datasets/untag which mimicks the I/O of the PHP API (modulo error handling)
  • DEL /datasets/{id}/tag?tag=... which is more semantically correct.

@PGijsbers PGijsbers added Migration Already exists in the old API Datasets Dataset endpoint POST POST endpoints labels May 3, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 3, 2026

Warning

Rate limit exceeded

@PGijsbers has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 2 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ae6f2a44-257d-49e1-ae9e-bf60f985f8ca

📥 Commits

Reviewing files that changed from the base of the PR and between 6d0bb5e and be5bc74.

📒 Files selected for processing (51)
  • src/core/access.py
  • src/core/errors.py
  • src/core/formatting.py
  • src/core/logging.py
  • src/database/datasets.py
  • src/database/evaluations.py
  • src/database/flows.py
  • src/database/qualities.py
  • src/database/runs.py
  • src/database/setups.py
  • src/database/studies.py
  • src/database/tasks.py
  • src/database/users.py
  • src/routers/dependencies.py
  • src/routers/mldcat_ap/dataset.py
  • src/routers/openml/datasets.py
  • src/routers/openml/estimation_procedure.py
  • src/routers/openml/evaluations.py
  • src/routers/openml/flows.py
  • src/routers/openml/qualities.py
  • src/routers/openml/runs.py
  • src/routers/openml/setups.py
  • src/routers/openml/study.py
  • src/routers/openml/tasks.py
  • src/routers/openml/tasktype.py
  • tests/conftest.py
  • tests/database/flows_test.py
  • tests/dependencies/fetch_user_test.py
  • tests/routers/openml/dataset_tag_test.py
  • tests/routers/openml/dataset_untag_test.py
  • tests/routers/openml/datasets_features_test.py
  • tests/routers/openml/datasets_get_test.py
  • tests/routers/openml/datasets_list_datasets_test.py
  • tests/routers/openml/datasets_qualities_test.py
  • tests/routers/openml/datasets_status_test.py
  • tests/routers/openml/estimation_procedure_test.py
  • tests/routers/openml/evaluation_measure_test.py
  • tests/routers/openml/flows_exists_test.py
  • tests/routers/openml/flows_get_test.py
  • tests/routers/openml/qualities_list_test.py
  • tests/routers/openml/runs_trace_test.py
  • tests/routers/openml/setups_get_test.py
  • tests/routers/openml/setups_tag_test.py
  • tests/routers/openml/setups_untag_test.py
  • tests/routers/openml/study_attach_test.py
  • tests/routers/openml/study_get_test.py
  • tests/routers/openml/study_post_test.py
  • tests/routers/openml/task_get_test.py
  • tests/routers/openml/task_list_test.py
  • tests/routers/openml/task_type_get_test.py
  • tests/routers/openml/task_type_list_test.py

Walkthrough

This change adds dataset tag removal support plus related type and test updates. Two async DB helpers were added: get_tag(dataset_id, tag, connection) to fetch a dataset-tag row and delete_tag(dataset_id, tag, connection) to remove it. A deprecated POST /datasets/untag wrapper and a new DELETE /datasets/{identifier}/tag endpoint were introduced; the DELETE endpoint distinguishes missing dataset vs missing tag, enforces uploader-or-admin authorization, deletes the tag, and returns HTTP 204. Tests were added for success and multiple failure modes. Several router endpoints were updated to use the Identifier and constrained string types; pyproject Python requirement bumped to >=3.14.

Possibly related PRs

  • openml/server-api PR 322: Modifies dataset tagging logic and router/database code; closely related to the new tag/untag helpers and endpoints.
  • openml/server-api PR 243: Introduced repository-wide async database changes that the new dataset-tag helpers build on.
  • openml/server-api PR 238: Introduced problem-detail error types and error-handling adjustments referenced by the new tag-related error handling.

Suggested labels

enhancement

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.95% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Dataset untag' directly relates to the main feature added: dataset untagging functionality across database helpers, routers, and tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description accurately describes the changes: adding dataset untag functionality with two endpoints (POST /datasets/untag and DELETE /datasets/{id}/tag).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dataset-untag

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 4 minutes and 2 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In untag_dataset, tag is annotated only with SystemString64 but not Body(), so FastAPI will treat it as a query parameter; this conflicts with the tests sending it in the JSON body, so consider wrapping tag with Body() or using a request model to make both data_id and tag body fields.
  • The new get_tag helper does a SELECT * even though only uploader is used; consider selecting just the needed columns (e.g., uploader) to keep the query minimal and the intent of the helper clearer.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `untag_dataset`, `tag` is annotated only with `SystemString64` but not `Body()`, so FastAPI will treat it as a query parameter; this conflicts with the tests sending it in the JSON body, so consider wrapping `tag` with `Body()` or using a request model to make both `data_id` and `tag` body fields.
- The new `get_tag` helper does a `SELECT *` even though only `uploader` is used; consider selecting just the needed columns (e.g., `uploader`) to keep the query minimal and the intent of the helper clearer.

## Individual Comments

### Comment 1
<location path="tests/routers/openml/dataset_untag_test.py" line_range="48-61" />
<code_context>
+    assert str(dataset_id) in e.value.detail
+
+
+async def test_dataset_untag_tag_not_owned(expdb_test: AsyncConnection) -> None:
+    dataset_id = 1
+    tag = "foo"
+    await expdb_test.execute(
+        text("INSERT INTO dataset_tag(id, tag, uploader) VALUES (:dataset_id, :tag, 1)"),
+        parameters={"dataset_id": dataset_id, "tag": tag},
+    )
+
+    with pytest.raises(TagNotOwnedError) as e:
+        await untag_dataset(dataset_id, tag, SOME_USER, expdb_test)
+
+    assert e.value.status_code == HTTPStatus.FORBIDDEN
+    assert tag in e.value.detail
+    assert str(dataset_id) in e.value.detail
+
+
</code_context>
<issue_to_address>
**suggestion (testing):** Add an assertion that the tag remains in the database when the user is not allowed to remove it

This test only checks that `TagNotOwnedError` and its message are correct. It should also query `dataset_tag` after the failure and assert the tag still exists, mirroring the success-case tests, to ensure the tag isn’t deleted before the permission check.

```suggestion
async def test_dataset_untag_tag_not_owned(expdb_test: AsyncConnection) -> None:
    dataset_id = 1
    tag = "foo"
    await expdb_test.execute(
        text("INSERT INTO dataset_tag(id, tag, uploader) VALUES (:dataset_id, :tag, 1)"),
        parameters={"dataset_id": dataset_id, "tag": tag},
    )

    with pytest.raises(TagNotOwnedError) as e:
        await untag_dataset(dataset_id, tag, SOME_USER, expdb_test)

    assert e.value.status_code == HTTPStatus.FORBIDDEN
    assert tag in e.value.detail
    assert str(dataset_id) in e.value.detail

    # Ensure the tag was not deleted when the user is not allowed to remove it
    result = await expdb_test.execute(
        text(
            "SELECT tag FROM dataset_tag "
            "WHERE id = :dataset_id AND tag = :tag"
        ),
        parameters={"dataset_id": dataset_id, "tag": tag},
    )
    row = result.fetchone()
    assert row is not None
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread tests/routers/openml/dataset_untag_test.py
@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.90%. Comparing base (3e7c4fc) to head (be5bc74).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #323      +/-   ##
==========================================
+ Coverage   93.69%   93.90%   +0.21%     
==========================================
  Files          68       69       +1     
  Lines        3154     3249      +95     
  Branches      223      227       +4     
==========================================
+ Hits         2955     3051      +96     
  Misses        139      139              
+ Partials       60       59       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/routers/openml/datasets.py (1)

86-104: ⚡ Quick win

Missing logger.info after successful untag.

tag_dataset logs each tag addition (Line 77); the corresponding untag operation has no structured log, creating an observability gap for auditing who removed which tag.

🪵 Proposed addition
     await database.datasets.delete_tag(data_id, tag, expdb_db)
+    logger.info("Dataset {data_id} untagged '{tag}'.", data_id=data_id, tag=tag)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 86 - 104, Add a structured info
log after a successful untag in the untag_dataset handler: after the await
database.datasets.delete_tag(data_id, tag, expdb_db) call, call logger.info with
a clear message including the user identifier (user.user_id), data_id and tag,
and any context fields (e.g., {"user_id": user.user_id, "data_id": data_id,
"tag": tag}) so removals are auditable; ensure you import or use the same logger
instance used by tag_dataset so logs are consistent.
tests/routers/openml/dataset_untag_test.py (1)

18-20: Use SOME_USER.user_id instead of hardcoded 2 in the INSERT statement.

The test hardcodes uploader=2 while making a request as ApiKey.SOME_USER (whose fixture user_id is also 2). If the fixture's user_id is ever changed, the INSERT will silently cause the endpoint to return 403 instead of 204, breaking the test. Reference SOME_USER.user_id directly in the parameters to keep the test's intent explicit and resilient to fixture changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/dataset_untag_test.py` around lines 18 - 20, The INSERT
in the test uses a hardcoded uploader id 2 causing fragility; update the
expdb_test.execute call that runs the INSERT INTO dataset_tag(...) to pass
uploader equal to SOME_USER.user_id instead of the literal 2 (i.e., replace the
hardcoded uploader value in the SQL parameters with the fixture
SOME_USER.user_id) so the test uses the fixture's actual user_id when creating
the tag and remains correct if the fixture changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/routers/openml/datasets.py`:
- Around line 86-104: Add a structured info log after a successful untag in the
untag_dataset handler: after the await database.datasets.delete_tag(data_id,
tag, expdb_db) call, call logger.info with a clear message including the user
identifier (user.user_id), data_id and tag, and any context fields (e.g.,
{"user_id": user.user_id, "data_id": data_id, "tag": tag}) so removals are
auditable; ensure you import or use the same logger instance used by tag_dataset
so logs are consistent.

In `@tests/routers/openml/dataset_untag_test.py`:
- Around line 18-20: The INSERT in the test uses a hardcoded uploader id 2
causing fragility; update the expdb_test.execute call that runs the INSERT INTO
dataset_tag(...) to pass uploader equal to SOME_USER.user_id instead of the
literal 2 (i.e., replace the hardcoded uploader value in the SQL parameters with
the fixture SOME_USER.user_id) so the test uses the fixture's actual user_id
when creating the tag and remains correct if the fixture changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 59ef5e5a-935b-429c-b493-71f336def752

📥 Commits

Reviewing files that changed from the base of the PR and between 3e7c4fc and c7851ad.

📒 Files selected for processing (3)
  • src/database/datasets.py
  • src/routers/openml/datasets.py
  • tests/routers/openml/dataset_untag_test.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/routers/openml/tasks.py (1)

419-421: ⚡ Quick win

get_task still uses plain int — consider aligning with the PR's Identifier pattern.

Every other task/dataset/run/flow/setup get_* endpoint in this PR now uses Identifier (positive-int validated). get_task was missed.

♻️ Suggested one-line fix
 `@router.get`("/{task_id}")
 async def get_task(
-    task_id: int,
+    task_id: Identifier,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/tasks.py` around lines 419 - 421, The get_task endpoint
currently types task_id as plain int; change its parameter type to the
Identifier validated type (e.g., async def get_task(task_id: Identifier)) and
add the corresponding import for Identifier at top of the module to align with
other endpoints; ensure any usages inside get_task (references to task_id)
remain valid with Identifier.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/routers/openml/datasets.py`:
- Around line 117-119: The except clause is using Python 2 comma syntax and will
raise a SyntaxError; change it to catch the two exceptions using a tuple:
replace the line "except DatasetNotFoundError, DatasetNoAccessError:" with an
"except (DatasetNotFoundError, DatasetNoAccessError):" clause in the same block
(the handler that builds msg and re-raises DatasetNotFoundError(msg, code=472)
from None), leaving the existing message and raise logic unchanged.
- Around line 100-103: After calling untag_dataset and fetching tags via
database.datasets.get_tags_for (see variables tags and data_id), guard against
tags being empty before indexing: if len(tags) == 0 set return_tags = [] (empty
list), if len(tags) == 1 set return_tags = tags[0] (unwrapped single tag),
otherwise keep return_tags = tags; then return {"data_untag": {"id":
str(data_id), "tag": return_tags}}. Ensure this logic replaces the current
unconditional tags[0] access to avoid IndexError in untag_dataset / get_tags_for
handling.

In `@tests/conftest.py`:
- Around line 186-192: Remove the unused "table" entry from the parameters dict
passed alongside the f-string SQL in tests/conftest.py: the query string already
interpolates table via f"INSERT INTO {table}(...)" and uses bind parameters
:identifier, :tag, :user_id, so delete the "table": table key from the dict
(leave identifier, tag, and user_id/OWNER_USER.user_id intact and keep the
existing # noqa: S608 comment).

---

Nitpick comments:
In `@src/routers/openml/tasks.py`:
- Around line 419-421: The get_task endpoint currently types task_id as plain
int; change its parameter type to the Identifier validated type (e.g., async def
get_task(task_id: Identifier)) and add the corresponding import for Identifier
at top of the module to align with other endpoints; ensure any usages inside
get_task (references to task_id) remain valid with Identifier.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ad51f9ae-3ecd-42ce-95bd-0d43d3ca0915

📥 Commits

Reviewing files that changed from the base of the PR and between c7851ad and 6d0bb5e.

📒 Files selected for processing (13)
  • pyproject.toml
  • src/routers/openml/datasets.py
  • src/routers/openml/flows.py
  • src/routers/openml/qualities.py
  • src/routers/openml/runs.py
  • src/routers/openml/setups.py
  • src/routers/openml/study.py
  • src/routers/openml/tasks.py
  • src/routers/types.py
  • tests/conftest.py
  • tests/routers/openml/dataset_untag_test.py
  • tests/routers/openml/datasets_get_test.py
  • tests/routers/openml/setups_untag_test.py
✅ Files skipped from review due to trivial changes (1)
  • pyproject.toml

Comment thread src/routers/openml/datasets.py Outdated
Comment thread src/routers/openml/datasets.py
Comment thread tests/conftest.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Datasets Dataset endpoint Migration Already exists in the old API POST POST endpoints

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant