Skip to content

[fix](streaming-job) recompute derived fields after replay and ALTER#62936

Merged
liaoxin01 merged 1 commit into
apache:masterfrom
JNSimba:fix/streaming-job-derived-fields-after-replay
May 14, 2026
Merged

[fix](streaming-job) recompute derived fields after replay and ALTER#62936
liaoxin01 merged 1 commit into
apache:masterfrom
JNSimba:fix/streaming-job-derived-fields-after-replay

Conversation

@JNSimba
Copy link
Copy Markdown
Member

@JNSimba JNSimba commented Apr 29, 2026

What problem does this PR solve?

Problem Summary:

StreamingInsertJob initializes two derived fields from jobProperties.max_interval in init():

  • sampleWindowMs = max_interval * 10 * 1000 — used by checkDataQuality() for the load.max_filter_ratio time window
  • jobConfig.timerDefinition.interval = max_interval — used by JobScheduler to compute next trigger time

Neither is persisted in the gson image, and neither is refreshed in two paths:

  1. gson replay (gsonPostProcess): after FE checkpoint restart, sampleWindowMs stays at default 0. The time-window check (now - sampleStartTime) > sampleWindowMs is then always true, so the sample window expires on every commit. The window-accumulation contract used by load.max_filter_ratio degrades to single-batch judgment, and a job recovered from image can be wrongly paused on a small bad batch that should be diluted by the surrounding window.

  2. ALTER PROPERTIES (modifyPropertiesInternal): changing max_interval only updates properties and jobProperties. Neither sampleWindowMs nor timerDefinition.interval is refreshed. The scheduler keeps reading the old interval (the new value never reaches JobExecutionConfiguration.getTriggerDelayTimes), so ALTER max_interval never takes effect — not even after FE restart, since image carries the stale interval too.

Fix

Extract a single recomputeDerivedFields() that re-derives all transient state from jobProperties:

  • sampleWindowMs = maxIntervalSec * 10 * 1000
  • timerDefinition.interval = maxIntervalSec
  • reset sampleStartTime / sampleWindowScannedRows / sampleWindowFilteredRows

Call it at every entry point where jobProperties is rebuilt:

  • init() (job creation)
  • gsonPostProcess() (image replay)
  • modifyPropertiesInternal() (ALTER PROPERTIES)

Resetting the sample counters on ALTER is intentional: changing max_interval redefines the window itself, so accumulated counts from the old window have no meaningful interpretation in the new one.

Release note

Fix streaming insert job sample window and scheduler interval not being restored after FE checkpoint replay or ALTER PROPERTIES.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes. (1) Streaming jobs recovered from FE image no longer wrongly pause on the first bad batch under load.max_filter_ratio. (2) ALTER max_interval now takes effect on the scheduler within the next batch tick (~10 minutes); previously it was silently ignored.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 29, 2026

/review

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 29, 2026

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes StreamingInsertJob recovery/ALTER behavior by re-deriving transient scheduling and data-quality window fields from jobProperties.max_interval, ensuring consistent behavior after image replay and after ALTER PROPERTIES.

Changes:

  • Add recomputeDerivedFields() to rebuild sampleWindowMs, refresh timerDefinition.interval, and reset sampling-window counters.
  • Invoke derived-field recomputation from init(), gsonPostProcess(), and modifyPropertiesInternal().
  • Add unit tests covering derived-field recomputation and ALTER refresh behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java Centralizes and re-applies derived-field computation on create/replay/ALTER paths.
fe/fe-core/src/test/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJobCheckDataQualityTest.java Adds tests verifying recomputation updates the sample window and scheduler interval, and resets counters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 29, 2026

run external

@JNSimba JNSimba force-pushed the fix/streaming-job-derived-fields-after-replay branch from c5c43ce to 0ce6766 Compare May 6, 2026 09:35
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 6, 2026

run buildall

2 similar comments
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 6, 2026

run buildall

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 7, 2026

run buildall

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 7, 2026

run cloud_p0

1 similar comment
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 7, 2026

run cloud_p0

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 11, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed for the full PR diff. I did not find additional distinct issues beyond the existing inline review threads already raised for StreamingInsertJob:

  • modifyPropertiesInternal() / recomputeDerivedFields() updates are not protected by the job write lock while commitOffset() uses the same sampling fields under the write lock.
  • sampleWindowMs and related interval calculations can overflow for very large max_interval values because validation currently has no upper bound.

I did not add duplicate inline comments for those already-known threads.

Critical checkpoint conclusions:

  • Goal: The PR addresses the intended replay/ALTER bug by centralizing recomputation of sampleWindowMs, scheduler interval, and sampling counters from jobProperties.max_interval on create, image replay, and ALTER paths.
  • Scope: The code change is small and focused, with one helper and three call sites.
  • Concurrency: The modified ALTER path touches shared job sampling/timer state; the known existing thread covers the missing job-level write lock. No separate concurrency issue was found.
  • Lifecycle/replay: gsonPostProcess() now rebuilds the non-persisted derived fields after jobProperties is restored; I did not find another replay incompatibility in this patch.
  • Configuration/compatibility: No new config item or wire/storage format is introduced. The behavior change is limited to applying existing max_interval consistently.
  • Parallel paths: Create, ALTER, and gson replay paths are covered by the new helper; task creation reads the refreshed jobProperties as before.
  • Tests: Added unit coverage checks recomputation, ALTER refresh, and counter reset. The two existing review threads identify remaining missing/unsafe cases not covered by these tests.
  • Persistence/transactions/data correctness: No transaction visibility or EditLog serialization format change was introduced; the existing job update log persists the updated job state after ALTER.
  • Observability/performance: No new logging or metrics appear necessary for this small derived-state recomputation; no additional performance issue was found.
  • User focus: No additional user-provided review focus was specified.

Copy link
Copy Markdown
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liaoxin01 liaoxin01 merged commit cf7a588 into apache:master May 14, 2026
30 of 31 checks passed
github-actions Bot pushed a commit that referenced this pull request May 14, 2026
…62936)

### What problem does this PR solve?

Problem Summary:

`StreamingInsertJob` initializes two derived fields from
`jobProperties.max_interval` in `init()`:
- `sampleWindowMs` = `max_interval * 10 * 1000` — used by
`checkDataQuality()` for the `load.max_filter_ratio` time window
- `jobConfig.timerDefinition.interval` = `max_interval` — used by
`JobScheduler` to compute next trigger time

Neither is persisted in the gson image, and neither is refreshed in two
paths:

1. **gson replay (`gsonPostProcess`)**: after FE checkpoint restart,
`sampleWindowMs` stays at default `0`. The time-window check `(now -
sampleStartTime) > sampleWindowMs` is then always true, so the sample
window expires on every commit. The window-accumulation contract used by
`load.max_filter_ratio` degrades to single-batch judgment, and a job
recovered from image can be wrongly paused on a small bad batch that
should be diluted by the surrounding window.

2. **ALTER PROPERTIES (`modifyPropertiesInternal`)**: changing
`max_interval` only updates `properties` and `jobProperties`. Neither
`sampleWindowMs` nor `timerDefinition.interval` is refreshed. The
scheduler keeps reading the old interval (the new value never reaches
`JobExecutionConfiguration.getTriggerDelayTimes`), so ALTER
`max_interval` never takes effect — not even after FE restart, since
image carries the stale `interval` too.

### Fix

Extract a single `recomputeDerivedFields()` that re-derives all
transient state from `jobProperties`:
- `sampleWindowMs = maxIntervalSec * 10 * 1000`
- `timerDefinition.interval = maxIntervalSec`
- reset `sampleStartTime` / `sampleWindowScannedRows` /
`sampleWindowFilteredRows`

Call it at every entry point where `jobProperties` is rebuilt:
- `init()` (job creation)
- `gsonPostProcess()` (image replay)
- `modifyPropertiesInternal()` (ALTER PROPERTIES)

Resetting the sample counters on ALTER is intentional: changing
`max_interval` redefines the window itself, so accumulated counts from
the old window have no meaningful interpretation in the new one.

### Release note

Fix streaming insert job sample window and scheduler interval not being
restored after FE checkpoint replay or ALTER PROPERTIES.
yiguolei pushed a commit that referenced this pull request May 20, 2026
…y and ALTER #62936 (#63261)

Cherry-picked from #62936

Co-authored-by: wudi <wudi@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants