Correctly account for isolated samples in normalisation of `pair_coalescence_*` methods by nspope · Pull Request #3460 · tskit-dev/tskit

nspope · 2026-05-15T02:30:49Z

There's the somewhat thorny issue of what the correct normalisation is when only one of span_normalise or pair_normalise are True. In this case (using the notation of the linked issue), what I've done is normalise by the "average number of pairs" $\int p(x) dx / (b - a)$ for "pair normalisation" (so the units are "sequence length per pair") -- this reduces to ${n \choose 2}$ when there's no isolated samples which agrees with the original implementation. Similarly I normalise by the "average pair span" $\int p(x) dx / {n \choose 2}$ for "span normalisation" (so that the units are "# of sample pairs") -- which simplifies to $b - a$ as per the original implementation where there are no isolated samples.

So the output (for the $i$th node or time window) is one of:

$C_i$ (unnormalised)
$C_i / \int p(x) dx$ (fully normalised)
$C_i \times (b - a) / \int p(x) dx$ (pair normalised)
$C_i \times {n \choose 2} / \int p(x) dx$ (span normalised)

The above assumes that there's no missing trees over the interval--- if there are, replace $b - a$ by "nonmissing span" (the original implementation discounted missing span as well)

codecov · 2026-05-15T02:32:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.68%. Comparing base (1c689c0) to head (73d8cd9).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3460      +/-   ##
==========================================
+ Coverage   91.67%   91.68%   +0.01%     
==========================================
  Files          38       38              
  Lines       32186    32248      +62     
  Branches     5150     5167      +17     
==========================================
+ Hits        29505    29567      +62     
  Misses       2348     2348              
  Partials      333      333

Flag	Coverage Δ
C	`82.27% <92.20%> (+0.04%)`	⬆️
c-python	`77.59% <97.36%> (+0.05%)`	⬆️
python-tests	`96.40% <ø> (ø)`
python-tests-no-jit	`33.20% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
Python API	`98.70% <ø> (ø)`
Python C interface	`91.23% <ø> (ø)`
C library	`91.27% <100.00%> (+0.03%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nspope · 2026-05-15T03:19:21Z

Added an additional test into the C suite for "ancestor-descendant" sample pairs

…samples

petrelharp · 2026-05-19T22:07:58Z

Slight correction to the above: I think that $p(x)$ here is (or should be) "potential pairs", which is "number of non-missing choose two" for a single sample set; in #3459 $p(x)$ is defined to exclude cases where one sample is a descendant of another. I don't think we should exclude those cases (since the interpretation is cleaner, for one thing), and I don't think this code does that.

petrelharp · 2026-05-19T22:11:36Z

        The output array has dimension `(windows, indexes, nodes)` with
        dimensions dropped when the corresponding argument is set to None.


This should be amended to include the time_windows case, right?

good catch, so nodes or time windows

petrelharp · 2026-05-19T22:15:57Z

+        coalesce at position `x` (omitting "isolated" samples). If both
+        `span_normalise` and `pair_normalise` are true, then the output is
+        divided by the integral of `p(x)` over the sequence, will thus sum to
+        one over the "time" dimension (provided all non-isolated samples trace


wait but what if the output is nodes?

I'm being sloppy here--- nodes are counted as the time dimension, and their values also sum to one (as time windowing is just summing node values within a window).

petrelharp · 2026-05-19T22:22:11Z

+        set, then the output is divided by `(integral p(x) dx) / (num_samples *
+        (num_samples - 1) / 2)` (the average non-missing sequence length per


Is this right? since according to this paragraph, this is just span_normalise=True, pair_normalise=True multiplied by (num_samples choose 2)? (also btw this description doesn't work for distinct sample sets)

Note: I'll believe you if you say it is right; I don't remember where we got to on this.

Let's see --- the denominator we're calculating here is "number of extant sample pairs summed across the contig" (that's p(x) at a particular coordinate x) then dividing it by the total number of pairs (however that is calculated, i.e. for distinct sample sets). So the output is just a scalar multiple of the "fully normalised" case.

Not sure what the question is exactly, let's chat about it in person then I can post a followup for clarity?

petrelharp · 2026-05-19T22:26:27Z

Could we record - hopefully in the documentation - concrete interpretations of what's being returned? Let's see, IIRC this is:

span_normalise=False, pair_normalize=False: "find for each pair on how much of the genome do they coalesce in this node/time interval, and add that up across all pairs"
span_normalise=True, pair_normalize=True: "pick a uniform (pair, location) out of all the possible ones; what's the probability that that pair at that position coalesces in this node/time interval"

I'm not sure right now what the interpretations for the others are (there may not be a tidy interpretation for the case of missing data; but there should at least be for the no-missing-data case).

petrelharp · 2026-05-19T23:16:08Z

Gee, this code is very nice, but there's a lot to get my head around.

I haven't yet found whether we are also comparing to the naive method to check this for correctness - is that happening?

nspope · 2026-05-19T23:25:25Z

I haven't yet found whether we are also comparing to the naive method to check this for correctness - is that happening?

Not in the code base, no --- were you thinking some sort of statistical check / benchmark baked into the unit tests? There's worked examples in there, but no "accuracy threshold" (which seems like it'd be a brittle test).

I've tested outside (and could post a MWE to the parent issue, if that'd help, alongside the figure shown there).

petrelharp · 2026-05-19T23:30:58Z

No, I mean a very naive "iterate over all pairs and count" method that we can compare to the efficient version (should be identical): we do have this in the testing code (naive_pair_coalescence_counts); but I haven't figured out if this has been updated to account for missing data?

nspope · 2026-05-19T23:35:26Z

I don't think we should exclude those cases (since the interpretation is cleaner, for one thing), and I don't think this code does that.

The code actually is excluding these cases (that's the "upwards propagation" bit of the update) -- are you saying that we shouldn't? This'll certainly simplify the code but will bias the downstream statistics (rates) if so.

nspope · 2026-05-19T23:37:22Z

No, I mean a very naive "iterate over all pairs and count" method that we can compare to the efficient version (should be identical): we do have this in the testing code (naive_pair_coalescence_counts); but I haven't figured out if this has been updated to account for missing data?

aha --- yes, the naive implementation does the correct thing for missing data already. There is a comparison on simulations the ("messy_ts" test fixture), though it could possibly use extending.

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

nspope mentioned this pull request May 15, 2026

Incorrect normalisation in TreeSequence.pair_coalescence_counts (and TreeSequence.pair_coalescence_rates) with isolated samples #3459

Open

Fix normalisation for pair coalescence stats when there are isolated …

be9f67f

…samples

nspope force-pushed the nsp-paircoal-partial-missing branch from 53a8d74 to be9f67f Compare May 15, 2026 03:51

petrelharp reviewed May 19, 2026

View reviewed changes

Comment thread python/tskit/trees.py Outdated

petrelharp reviewed May 19, 2026

View reviewed changes

Comment thread python/tskit/trees.py Outdated

petrelharp reviewed May 19, 2026

View reviewed changes

nspope and others added 2 commits May 20, 2026 12:20

Update python/tskit/trees.py

ae53f66

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

Update python/tskit/trees.py

73d8cd9

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

		The output array has dimension `(windows, indexes, nodes)` with
		dimensions dropped when the corresponding argument is set to None.

		set, then the output is divided by `(integral p(x) dx) / (num_samples *
		(num_samples - 1) / 2)` (the average non-missing sequence length per

Conversation

nspope commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nspope commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petrelharp commented May 19, 2026

Uh oh!

Uh oh!

petrelharp May 19, 2026

Choose a reason for hiding this comment

Uh oh!

nspope May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

petrelharp May 19, 2026

Choose a reason for hiding this comment

Uh oh!

nspope May 19, 2026

Choose a reason for hiding this comment

Uh oh!

petrelharp May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nspope May 19, 2026

Choose a reason for hiding this comment

Uh oh!

petrelharp commented May 19, 2026

Uh oh!

petrelharp commented May 19, 2026

Uh oh!

nspope commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petrelharp commented May 19, 2026

Uh oh!

nspope commented May 19, 2026

Uh oh!

nspope commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nspope commented May 15, 2026 •

edited

Loading

codecov Bot commented May 15, 2026 •

edited

Loading

nspope commented May 15, 2026 •

edited

Loading

petrelharp May 19, 2026 •

edited

Loading

nspope commented May 19, 2026 •

edited

Loading