Move public skills to a directory to avoid downloading the whole repo by ntjohnson1 · Pull Request #1519 · apache/datafusion-python

ntjohnson1 · 2026-04-28T21:04:26Z

Which issue does this PR close?

Closes #1518.

Rationale for this change

It is unexpected to install the whole repo for the skill

What changes are included in this PR?

Moves the shareable public skills to /skills

verify with: npx skills add https://github.com/rerun-io/datafusion-python/tree/move_skill
Can also confirm no extra skills are detected by appending --list

Are there any user-facing changes?

No

timsaucer

Good catch. Thanks Nick.

Upstream apache#1519 moved the root `SKILL.md` to `skills/datafusion_python/SKILL.md` so that consumers can install the skill without cloning the whole repo. Update all repo-internal links and external GitHub URLs in the docs site, README, AGENTS.md, and the package docstring to point at the new location. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: publish SKILL.md on the docs site via myst include Adds a new `skill` page that embeds the repo-root `SKILL.md` through the myst `{include}` directive, so the agent-facing guide lives on the published docs site without duplication. The page is wired into the User Guide toctree. Implements PR 4a of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: publish llms.txt at docs site root Adds `docs/source/llms.txt` in llmstxt.org schema: a short description plus categorized links to the agent skill, user guide pages, DataFrame API reference, and example queries. `html_extra_path` in `conf.py` copies it verbatim to the published site root so it resolves at `https://datafusion.apache.org/python/llms.txt`. Implements PR 4b of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: add write-dataframe-code contributor skill Adds `.ai/skills/write-dataframe-code/SKILL.md`, a contributor-facing skill for agents working on this repo. It layers on top of the user-facing repo-root SKILL.md with: - a TPC-H pattern index mapping idiomatic API usages to the query file that demonstrates them, - an ad-hoc plan-comparison workflow for checking DataFrame translations against a reference SQL query via `optimized_logical_plan()`, and - the project-specific docstring and aggregate/window documentation conventions that CLAUDE.md already enforces for contributors. Implements PR 4c of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: add audit-skill-md skill Adds `.ai/skills/audit-skill-md/SKILL.md`, a contributor skill that cross-references the repo-root `SKILL.md` against the current public Python API (functions module, DataFrame, Expr, SessionContext, and package-root re-exports). Reports two classes of drift: - new APIs exposed by the Python surface that are not yet covered in the user-facing guide, and - stale mentions in the guide that no longer exist in the public API. The skill is diff-only — it produces a report the user reviews before any edit to `SKILL.md`. Complements `check-upstream/`, which audits in the opposite direction (upstream Rust features not yet exposed). Implements PR 4d of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: enrich RST pages with demos relocated from TPC-H rewrite Moves the illustrative patterns that #1504 removed from the TPC-H examples into the common-operations docs, where they serve as pattern-focused teaching material without cluttering the TPC-H translations: - expressions.rst gains a "Testing membership in a list" section comparing `|`-compound filters, `in_list`, and `array_position` + `make_array`, plus a "Conditional expressions" section contrasting switched and searched `case`. - udf-and-udfa.rst gains a "When not to use a UDF" subsection showing the compound-OR predicate that replaces a Python-side UDF for disjunctive bucket filters (the Q19 case). - aggregations.rst gains a "Building per-group arrays" subsection covering `array_agg(filter=..., distinct=True)` with `array_length`/`array_element` for the single-value-per-group pattern (the Q21 case). - Adds `examples/array-operations.py`, a runnable end-to-end walkthrough of the membership and array_agg patterns. Implements PR 4e of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: wire new contributor skills and plan-comparison diagnostic into AGENTS.md - List the three contributor skills (`check-upstream`, `write-dataframe-code`, `audit-skill-md`) under the Skills section so agents know what tools they have before starting work. - Document the plan-comparison diagnostic workflow (comparing `ctx.sql(...).optimized_logical_plan()` against a DataFrame's `optimized_logical_plan()` via `LogicalPlan.__eq__`) for translating SQL queries to DataFrame form. Points at the full write-up in the `write-dataframe-code` skill rather than duplicating it. `CLAUDE.md` is a symlink to `AGENTS.md`, so the change lands in both. Implements PR 4f of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: rename aggregations.rst demo df to orders_df to avoid clobbering state The "Building per-group arrays" block added in the previous commit reassigned `df` and `ctx` mid-page, which then broke the Grouping Sets examples further down that share the Pokemon `df` binding (`col_type_1` etc. no longer resolved). Rename the demo DataFrame to `orders_df` and drop the redundant `ctx = SessionContext()` so the shared state from the top of the page stays intact. Verified with `sphinx-build -W --keep-going` against the full docs tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: replace raw SKILL.md include with a human-written AI-assistants page The previous approach embedded the repo-root `SKILL.md` on the docs site via a myst `{include}`. That file is written for agents -- dense, skill-formatted, and not suited to a human browsing the User Guide. It also relied on a fragile `:start-line:` offset to strip YAML frontmatter. Replace it with `docs/source/ai-coding-assistants.md`, a short human-readable page that mirrors the README section added in #1503: what the skill is, how to install it via `npx skills` or a manual pointer, and what kinds of things it covers. `SKILL.md` stays at the repo root as the single source of truth; agents fetch the raw GitHub URL directly. `llms.txt` is updated to point its Agent Guide entry at `raw.githubusercontent.com/.../SKILL.md` and to include the new human-readable page as a secondary link. The User Guide toctree now references `ai-coding-assistants` in place of the removed `skill` stub. Verified with `sphinx-build -W --keep-going` against the full docs tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: drop redundant assistants list in ai-coding-assistants intro The introduction and the "Installing the skill" section both enumerated the same set of supported assistants. Drop the intro copy; the list that matters is next to `npx skills add`, where it answers "what does this command actually configure?" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: convert ai-coding-assistants page from markdown to rst, shorten title Every other page in `docs/source/user-guide` and the top-level `docs/source` is written in reStructuredText; the lone `.md` page was an inconsistency. Rewrite in rst so the ASF header matches the rest of the tree, cross-references can use `:py:func:` roles if we ever add any, and myst is no longer required just to render this one page. Also shorten the page title from "Using DataFusion with AI Coding Assistants" to "Using AI Coding Assistants" -- it already sits under the DataFusion user guide so the product name is redundant. Verified with `sphinx-build -W --keep-going`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: drop audit-skill-md skill The skill as written pushed for every public method to be mentioned in `SKILL.md`, which is the wrong goal. `SKILL.md` is a distilled agent guide of idiomatic patterns and pitfalls, not an API reference -- autoapi-generated docs and module docstrings already provide full per-method coverage. An audit pressing for 100% method coverage would bloat the skill file into a stale copy of that reference. The two checks with actual value (stale mentions in `SKILL.md`, and drift between `functions.__all__` and the categorized function list) are small enough to be ad-hoc greps at release time and do not warrant a dedicated skill. Also remove references to the skill from `AGENTS.md` and the `write-dataframe-code` skill's "Related" section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: drop write-dataframe-code skill A separate PR covers the same contributor-facing material (TPC-H pattern index, plan-comparison workflow, docstring conventions), so this skill is redundant. Remove the skill directory and the corresponding references in `AGENTS.md`, including the plan-comparison section that pointed at it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: show Parquet pushdown plan diff in "When not to use a UDF" The previous version of the section asserted that a UDF predicate blocks optimizer rewrites but did not show evidence. Replace the two `code-block` examples with an executable walkthrough that writes a small Parquet file, runs the same filter two ways, and prints the physical plan for each. The native-expression plan renders with three annotations on the `DataSourceExec` node that the UDF plan does not have: - `predicate=brand@1 = A AND qty@2 >= 150` pushed into the scan - `pruning_predicate=... brand_min@0 <= A AND ... qty_max@4 >= 150` for row-group pruning via Parquet footer min/max stats - `required_guarantees=[brand in (A)]` for bloom-filter / dictionary skipping The UDF form keeps only `predicate=brand_qty_filter(...)`: the scan has to materialize every row group and call the Python callback. The disjunctive-OR rewrite (previously the main example) stays at the end as the idiomatic alternative for multi-bucket filters. Verified with `sphinx-build -W --keep-going`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: rework "subsets within a group" aggregation example Rename the section from "Building per-group arrays" to "Comparing subsets within a group" so the heading matches the content. Rewrite the intro to lead with the problem (compare full group vs filtered subset), reframe the worked example around partially failed orders, and replace the trailing bullet list with a one-line walkthrough of the result. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: clarify "When not to use a UDF" intro Rewrite the opening of the section to make three things clearer: the contrast is with native DataFusion expressions (not Python in general), some predicates genuinely feel easier to write as a Python loop and that tension is worth acknowledging, and predicate pushdown is a table-provider mechanism rather than a Parquet-only feature. Parquet stays as the concrete demo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: move ai-coding-assistants under user-guide/ The page was sitting at the top level of docs/source/ while every other page in the USER GUIDE toctree lives under docs/source/user-guide/. Move the file, update the toctree entry, and update the absolute URL in llms.txt to match the new path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: replace AGENTS.md skill list with discovery instructions A static skill list in AGENTS.md goes stale as new skills are added (it already missed the make-pythonic skill that was merged separately). Replace the enumerated list with a pointer telling agents to list .ai/skills/ and read each SKILL.md frontmatter, so the catalog never has to be hand-maintained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: fix broken llms.txt link and stale otherwise xref - ai-coding-assistants.rst: use absolute https://datafusion.apache.org/python/llms.txt URL; the relative `llms.txt` resolved to /python/user-guide/llms.txt and 404'd because html_extra_path publishes the file at the site root. - expressions.rst: drop the broken `:py:meth:~datafusion.expr.Expr.otherwise` xref (otherwise lives on CaseBuilder, not Expr) and spell the recommended replacement as `f.when(f.in_list(...), value).otherwise(default)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update SKILL.md path after move to skills/datafusion_python/ Upstream #1519 moved the root `SKILL.md` to `skills/datafusion_python/SKILL.md` so that consumers can install the skill without cloning the whole repo. Update all repo-internal links and external GitHub URLs in the docs site, README, AGENTS.md, and the package docstring to point at the new location. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ntjohnson1 added 2 commits April 28, 2026 16:59

Move to skill directory

b6b1ee2

Avoid moved skill with test

d52993a

ntjohnson1 changed the title ~~Move to skill directory~~ Move public skills to a directory to avoid downloading the whole repo Apr 28, 2026

timsaucer approved these changes Apr 29, 2026

View reviewed changes

timsaucer merged commit c657dad into apache:main Apr 29, 2026
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move public skills to a directory to avoid downloading the whole repo#1519

Move public skills to a directory to avoid downloading the whole repo#1519
timsaucer merged 2 commits intoapache:mainfrom
rerun-io:move_skill

ntjohnson1 commented Apr 28, 2026 •

edited

Loading

Uh oh!

timsaucer left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ntjohnson1 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ntjohnson1 commented Apr 28, 2026 •

edited

Loading