Refactor Dataset Tag Function by PGijsbers · Pull Request #322 · openml/server-api

PGijsbers · 2026-05-02T17:13:10Z

Rely on the database to identify if the dataset is absent or the tag already exists.

coderabbitai · 2026-05-02T17:13:20Z

Warning

Rate limit exceeded

@pre-commit-ci[bot] has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 53 minutes and 50 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0ea2ff1f-60d6-49be-89a3-be7a7515abf7

📥 Commits

Reviewing files that changed from the base of the PR and between 196d486 and d9814e1.

📒 Files selected for processing (2)

.gitignore
src/routers/openml/datasets.py

Walkthrough

This PR enhances error handling in the dataset tagging workflow across multiple layers. It introduces a new database exceptions module defining constraint violation exceptions, modifies the database layer to catch and re-raise IntegrityError as domain-specific exceptions, updates the REST API endpoint to use an Identifier type for validation and handle database-layer exceptions, and adds corresponding test coverage for dataset-not-found scenarios. The workflow configuration is simplified by removing Docker container exit code error handling.

Possibly related PRs

Update tests to reduce amount of times py_api fixture is used #293: Both PRs modify dataset tagging test behavior and assertion logic for the tag_dataset endpoint, specifically around error handling during the tagging operation.
Migrate POST /setup/untag endpoint (#65) #246: Both PRs make changes to the OpenML datasets tagging endpoint in src/routers/openml/datasets.py, including parameter signature and dependency handling updates.
Improve consistency of test variable names and ordering #310: Both PRs modify the same test file tests/routers/openml/dataset_tag_test.py with updates to test structure and assertion patterns.

Suggested labels

tests

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Refactor Dataset Tag Function' clearly describes the main change—refactoring the dataset tagging logic to rely on database-level validation.
Description check	✅ Passed	The description directly relates to the changeset by explaining the core motivation: moving dataset validation and duplicate detection to the database layer rather than handling it elsewhere.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch tags

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 53 minutes and 50 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

In tag_dataset, the TypeError message for a non-AsyncConnection expdb_db uses {type(expdb_db)} inside a normal string; convert this to an f-string (or use %s) so the actual type is interpolated instead of the literal placeholder.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `tag_dataset`, the TypeError message for a non-`AsyncConnection` `expdb_db` uses `{type(expdb_db)}` inside a normal string; convert this to an f-string (or use `%s`) so the actual type is interpolated instead of the literal placeholder.

## Individual Comments

### Comment 1
<location path="src/routers/openml/datasets.py" line_range="65-66" />
<code_context>
-    assert expdb_db is not None  # noqa: S101
-    tags = await database.datasets.get_tags_for(data_id, expdb_db)
-    if tag.casefold() in [t.casefold() for t in tags]:
+    if not isinstance(expdb_db, AsyncConnection):
+        msg = "`expdb_db` should be an AsyncConnection, is {type(expdb_db)}"
+        raise TypeError(msg)
+
</code_context>
<issue_to_address>
**issue:** The TypeError message isn’t interpolating `type(expdb_db)` as intended.

`msg` is defined as a normal string, so `{type(expdb_db)}` will be rendered literally. Make this an f-string (or use `%s`/`.format`) so the actual type is shown, e.g. `f"`expdb_db` should be an AsyncConnection, is {type(expdb_db)!r}"`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

for more information, see https://pre-commit.ci

coderabbitai

🧹 Nitpick comments (2)

docs/contributing/tests.md (1)

31-37: ⚡ Quick win

Polish wording: use “Elasticsearch” (not “elastic search”).

The docs say: “Some tests require elastic search indices to be constructed.” Standard term is Elasticsearch; this also avoids style/grammar flags (e.g., hyphenation/compound-word suggestions).
Suggested change
-Some tests require elastic search indices to be constructed.
+Some tests require Elasticsearch indices to be constructed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/contributing/tests.md` around lines 31 - 37, Update wording in the docs
to use the proper product name "Elasticsearch" instead of "elastic search"
wherever it appears (e.g., the sentence that currently reads "Some tests require
elastic search indices to be constructed.") and ensure the variable name
INDEX_ES_DURING_STARTUP and reference to `docker/php/.env` remain unchanged;
simply replace "elastic search" with "Elasticsearch" throughout the paragraph to
correct casing and compound-word usage.

.github/workflows/tests.yml (1)

33-39: ⚡ Quick win

Make the .env flip more robust (avoid brittle exact-string sed).

Right now the workflow only replaces the exact substring INDEX_ES_DURING_STARTUP=false. If the .env file contains whitespace/quotes (e.g., INDEX_ES_DURING_STARTUP = false or 'false') or the value is already true, the replacement may silently not occur—leading to Elasticsearch index-dependent tests behaving inconsistently.

Suggested change

       if [ "${{ matrix.php_api }}" = "true" ]; then
-            sed -i 's/INDEX_ES_DURING_STARTUP=false/INDEX_ES_DURING_STARTUP=true/' docker/php/.env
+            # Flip INDEX_ES_DURING_STARTUP in a whitespace/quote-tolerant way; fail if the key isn't found.
+            if ! sed -i -E 's/^(INDEX_ES_DURING_STARTUP[[:space:]]*=[[:space:]]*)("|'\"''\"'\"'|)?false("|'\"''\"'\"'|)?[[:space:]]*$/\1true/Ig' docker/php/.env; then
+              echo "Failed to update INDEX_ES_DURING_STARTUP in docker/php/.env" >&2
+              exit 1
+            fi
             services="$services php-api"
       fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/tests.yml around lines 33 - 39, The current sed invocation
that targets the exact string INDEX_ES_DURING_STARTUP=false is brittle; update
the workflow step that builds services (the block setting services="python-api"
and the sed invocation) to use a more robust regex-based edit against
docker/php/.env so it matches the key regardless of surrounding whitespace or
quotes and sets its value to true (and if the key is missing, append
INDEX_ES_DURING_STARTUP=true to the file). Ensure the change updates lines like
INDEX_ES_DURING_STARTUP = 'false' or "false" and is idempotent if the value is
already true.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/tests.yml:
- Around line 33-39: The current sed invocation that targets the exact string
INDEX_ES_DURING_STARTUP=false is brittle; update the workflow step that builds
services (the block setting services="python-api" and the sed invocation) to use
a more robust regex-based edit against docker/php/.env so it matches the key
regardless of surrounding whitespace or quotes and sets its value to true (and
if the key is missing, append INDEX_ES_DURING_STARTUP=true to the file). Ensure
the change updates lines like INDEX_ES_DURING_STARTUP = 'false' or "false" and
is idempotent if the value is already true.

In `@docs/contributing/tests.md`:
- Around line 31-37: Update wording in the docs to use the proper product name
"Elasticsearch" instead of "elastic search" wherever it appears (e.g., the
sentence that currently reads "Some tests require elastic search indices to be
constructed.") and ensure the variable name INDEX_ES_DURING_STARTUP and
reference to `docker/php/.env` remain unchanged; simply replace "elastic search"
with "Elasticsearch" throughout the paragraph to correct casing and
compound-word usage.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 15336861-a4d5-4cac-bb7b-c44320c7156f

📥 Commits

Reviewing files that changed from the base of the PR and between 6350bbf and 2bd8068.

📒 Files selected for processing (7)

.github/workflows/tests.yml
docs/contributing/tests.md
src/database/datasets.py
src/routers/openml/datasets.py
src/routers/types.py
tests/constants.py
tests/routers/openml/dataset_tag_test.py

for more information, see https://pre-commit.ci

codecov · 2026-05-02T18:01:34Z

Codecov Report

❌ Patch coverage is 96.29630% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.69%. Comparing base (6350bbf) to head (d9814e1).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/database/datasets.py	81.81%	1 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (96.29%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #322      +/-   ##
==========================================
+ Coverage   93.66%   93.69%   +0.02%     
==========================================
  Files          67       68       +1     
  Lines        3109     3154      +45     
  Branches      221      223       +2     
==========================================
+ Hits         2912     2955      +43     
- Misses        138      139       +1     
- Partials       59       60       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Previously there was a database startup service, which would cause docker compose up to exit with error code 1 (see also the referenced issue). But since this is no longer the case, we can just rely on normal tool exit code.

There is no status that depends on it, and it's not required to make sure that the `docker compose` command exits with code 1 if startup fails.

for more information, see https://pre-commit.ci

sourcery-ai Bot reviewed May 2, 2026

View reviewed changes

Comment thread src/routers/openml/datasets.py Outdated

PGijsbers added 6 commits May 2, 2026 19:16

towards shared tagging logic

05de3a4

Continue with shared tagging logic

d302499

Add error type parameter

738d5dd

Go back to the direct method of creating the tag

43c1eca

Add test to verify behavior if dataset is not present

d32df38

Update migration test to include testing behavior for dataset 404

50d6d33

PGijsbers force-pushed the tags branch from 2bd8068 to 50d6d33 Compare May 2, 2026 17:16

[pre-commit.ci] auto fixes from pre-commit.com hooks

1e8337f

for more information, see https://pre-commit.ci

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

PGijsbers and others added 4 commits May 2, 2026 19:20

Remove typecheck; covered by mypy and pydantic

b272025

[pre-commit.ci] auto fixes from pre-commit.com hooks

bd71889

for more information, see https://pre-commit.ci

Add healthcheck to avoid silent failure (esp. in CI)

18f177a

Define the exceptions

bd4623b

PGijsbers added 2 commits May 2, 2026 20:07

Remove the crutch that was needed for startup container

3005335

Previously there was a database startup service, which would cause docker compose up to exit with error code 1 (see also the referenced issue). But since this is no longer the case, we can just rely on normal tool exit code.

Remove health check

ad61f21

There is no status that depends on it, and it's not required to make sure that the `docker compose` command exits with code 1 if startup fails.

PGijsbers force-pushed the tags branch from 1a396e2 to ad61f21 Compare May 2, 2026 18:16

[no ci] add .claude/ and .ignore/

a7e9e92

PGijsbers force-pushed the tags branch from 196d486 to a7e9e92 Compare May 2, 2026 18:23

[pre-commit.ci] auto fixes from pre-commit.com hooks

d9814e1

for more information, see https://pre-commit.ci

PGijsbers merged commit 3e7c4fc into main May 2, 2026
7 of 8 checks passed

PGijsbers deleted the tags branch May 2, 2026 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor Dataset Tag Function#322

Refactor Dataset Tag Function#322
PGijsbers merged 15 commits intomainfrom
tags

PGijsbers commented May 2, 2026

Uh oh!

coderabbitai Bot commented May 2, 2026 •

edited

Loading

Rate limit exceeded

❌ Failed checks (1 warning)

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented May 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

PGijsbers commented May 2, 2026

Uh oh!

coderabbitai Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 2, 2026 •

edited

Loading

codecov Bot commented May 2, 2026 •

edited

Loading