Skip to content

SIGSEGV in getCdbComponentInfo() when standby coordinator is on dedicated host#1702

Open
jangjang0401 wants to merge 5 commits intoapache:mainfrom
jangjang0401:fix/hotstandby-cdbutil
Open

SIGSEGV in getCdbComponentInfo() when standby coordinator is on dedicated host#1702
jangjang0401 wants to merge 5 commits intoapache:mainfrom
jangjang0401:fix/hotstandby-cdbutil

Conversation

@jangjang0401
Copy link
Copy Markdown

What does this PR do?

When a hot standby coordinator runs on a dedicated host with no primary
segments, getCdbComponentInfo() crashes with SIGSEGV.

hostPrimaryCountHash is built from primary segment hosts only. When
IS_HOT_STANDBY_QD() is true, the loops over segment_db_info and
entry_db_info also process mirror/standby entries. On a dedicated
standby host that owns no primary segments, HASH_FIND returns NULL.
The subsequent Assert(found) or direct dereference of hsEntry causes
a SIGSEGV, crashing every backend that attempts to connect.

The bug is masked when the standby coordinator shares a host with primary
segments, because the hostname already exists in the hash by coincidence.

Fix by treating a missing hash entry as a count of zero.

Type of Change

  • Bug fix (non-breaking change)

Test Plan

  • Reproduced on a cluster where the standby coordinator is on a dedicated
    host (no primary segments). psql to the standby crashed with SIGSEGV
    before this fix and connects successfully after.

Impact

User-facing changes:
Connections to a hot standby coordinator no longer crash when the standby
host runs no primary segments.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @jangjang0401 welcome!🎊 Thanks for taking the effort to make our project better! 🙌 Keep making such awesome contributions!

Copy link
Copy Markdown
Contributor

@my-ship-it my-ship-it left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor nits. Suggest:

  1. Add the Assert(found || IS_HOT_STANDBY_QD()) to keep the primary-path invariant.
  2. Mirror the explanatory comment on the second call site (or drop it from the first).

@yjhjstz yjhjstz force-pushed the fix/hotstandby-cdbutil branch 2 times, most recently from a650064 to af7447c Compare April 27, 2026 18:38
getCdbComponentInfo() populates hostPrimaryCountHash with primary hosts only.
When IS_HOT_STANDBY_QD() is true, mirror and standby hosts are also looked up
in the hash but return NULL on dedicated standby nodes that host no primary
segments. Replace Assert(found) with a null-safe check to prevent SIGSEGV.
@jangjang0401 jangjang0401 force-pushed the fix/hotstandby-cdbutil branch from af7447c to 446b0ab Compare April 27, 2026 23:51
@jangjang0401
Copy link
Copy Markdown
Author

@my-ship-it

Thanks for the review!

  • Added Assert(found || IS_HOT_STANDBY_QD()) to preserve the primary-path invariant.
  • Mirrored the explanatory comment on the second call site for consistency.

Let me know if anything else should be adjusted.

@jangjang0401
Copy link
Copy Markdown
Author

@yjhjstz @my-ship-it
Regarding the vacuum_gp check failure:

It appears that the vacuum_gp failure is an indirect effect of this patch.
Before the fix, the standby coordinator crashed on every connection, which caused FTS to disable synchronous replication. After the fix, however, the standby remains healthy and synchronous replication stays enabled. As a result, unexpected DETAIL/WARNING messages seem to appear during the fault injection process.

I would appreciate it if you could confirm whether my understanding is correct.
Additionally, in this case, would it be appropriate to add SET synchronous_commit = local before the fault injection block in vacuum_gp.sql, or would you recommend handling this in a different way?

@my-ship-it
Copy link
Copy Markdown
Contributor

@yjhjstz @my-ship-it Regarding the vacuum_gp check failure:

It appears that the vacuum_gp failure is an indirect effect of this patch. Before the fix, the standby coordinator crashed on every connection, which caused FTS to disable synchronous replication. After the fix, however, the standby remains healthy and synchronous replication stays enabled. As a result, unexpected DETAIL/WARNING messages seem to appear during the fault injection process.

I would appreciate it if you could confirm whether my understanding is correct. Additionally, in this case, would it be appropriate to add SET synchronous_commit = local before the fault injection block in vacuum_gp.sql, or would you recommend handling this in a different way?

Right direction, but I'd scope it more tightly:

BEGIN;
SET LOCAL synchronous_commit = local;
SELECT gp_inject_fault('interconnect_stop_recv_chunk', 'interrupt', dbid) ...;
analyze relcache_leak_in_motion;
SELECT gp_inject_fault(... 'reset' ...);
COMMIT;

@jangjang0401
Copy link
Copy Markdown
Author

@yjhjstz @my-ship-it Regarding the vacuum_gp check failure:
It appears that the vacuum_gp failure is an indirect effect of this patch. Before the fix, the standby coordinator crashed on every connection, which caused FTS to disable synchronous replication. After the fix, however, the standby remains healthy and synchronous replication stays enabled. As a result, unexpected DETAIL/WARNING messages seem to appear during the fault injection process.
I would appreciate it if you could confirm whether my understanding is correct. Additionally, in this case, would it be appropriate to add SET synchronous_commit = local before the fault injection block in vacuum_gp.sql, or would you recommend handling this in a different way?

Right direction, but I'd scope it more tightly:

BEGIN;
SET LOCAL synchronous_commit = local;
SELECT gp_inject_fault('interconnect_stop_recv_chunk', 'interrupt', dbid) ...;
analyze relcache_leak_in_motion;
SELECT gp_inject_fault(... 'reset' ...);
COMMIT;

@my-ship-it
Thank you for the suggestion! I've applied SET LOCAL synchronous_commit = local within a BEGIN/COMMIT block as recommended.

I haven't updated vacuum_gp.out yet since I'm not entirely sure what the exact output will look like for the commands inside the aborted transaction state. Would it be okay to update vacuum_gp.out after checking the actual CI output, or could you provide some guidance on the expected output?

@jangjang0401
Copy link
Copy Markdown
Author

@yjhjstz @my-ship-it Hi, could someone please approve the workflows run when you get a chance? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants