Skip to content

Correctly account for isolated samples in normalisation of pair_coalescence_* methods#3460

Open
nspope wants to merge 3 commits into
tskit-dev:mainfrom
nspope:nsp-paircoal-partial-missing
Open

Correctly account for isolated samples in normalisation of pair_coalescence_* methods#3460
nspope wants to merge 3 commits into
tskit-dev:mainfrom
nspope:nsp-paircoal-partial-missing

Conversation

@nspope
Copy link
Copy Markdown
Contributor

@nspope nspope commented May 15, 2026

Fixes #3459

There's the somewhat thorny issue of what the correct normalisation is when only one of span_normalise or pair_normalise are True. In this case (using the notation of the linked issue), what I've done is normalise by the "average number of pairs" $\int p(x) dx / (b - a)$ for "pair normalisation" (so the units are "sequence length per pair") -- this reduces to ${n \choose 2}$ when there's no isolated samples which agrees with the original implementation. Similarly I normalise by the "average pair span" $\int p(x) dx / {n \choose 2}$ for "span normalisation" (so that the units are "# of sample pairs") -- which simplifies to $b - a$ as per the original implementation where there are no isolated samples.

So the output (for the $i$th node or time window) is one of:

  • $C_i$ (unnormalised)
  • $C_i / \int p(x) dx$ (fully normalised)
  • $C_i \times (b - a) / \int p(x) dx$ (pair normalised)
  • $C_i \times {n \choose 2} / \int p(x) dx$ (span normalised)

The above assumes that there's no missing trees over the interval--- if there are, replace $b - a$ by "nonmissing span" (the original implementation discounted missing span as well)

@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.68%. Comparing base (1c689c0) to head (73d8cd9).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3460      +/-   ##
==========================================
+ Coverage   91.67%   91.68%   +0.01%     
==========================================
  Files          38       38              
  Lines       32186    32248      +62     
  Branches     5150     5167      +17     
==========================================
+ Hits        29505    29567      +62     
  Misses       2348     2348              
  Partials      333      333              
Flag Coverage Δ
C 82.27% <92.20%> (+0.04%) ⬆️
c-python 77.59% <97.36%> (+0.05%) ⬆️
python-tests 96.40% <ø> (ø)
python-tests-no-jit 33.20% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Python API 98.70% <ø> (ø)
Python C interface 91.23% <ø> (ø)
C library 91.27% <100.00%> (+0.03%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nspope
Copy link
Copy Markdown
Contributor Author

nspope commented May 15, 2026

Added an additional test into the C suite for "ancestor-descendant" sample pairs

@nspope nspope force-pushed the nsp-paircoal-partial-missing branch from 53a8d74 to be9f67f Compare May 15, 2026 03:51
@petrelharp
Copy link
Copy Markdown
Contributor

Slight correction to the above: I think that $p(x)$ here is (or should be) "potential pairs", which is "number of non-missing choose two" for a single sample set; in #3459 $p(x)$ is defined to exclude cases where one sample is a descendant of another. I don't think we should exclude those cases (since the interpretation is cleaner, for one thing), and I don't think this code does that.

Comment thread python/tskit/trees.py Outdated
Comment thread python/tskit/trees.py
Comment on lines 10669 to 10670
The output array has dimension `(windows, indexes, nodes)` with
dimensions dropped when the corresponding argument is set to None.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be amended to include the time_windows case, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, so nodes or time windows

Comment thread python/tskit/trees.py Outdated
Comment thread python/tskit/trees.py
coalesce at position `x` (omitting "isolated" samples). If both
`span_normalise` and `pair_normalise` are true, then the output is
divided by the integral of `p(x)` over the sequence, will thus sum to
one over the "time" dimension (provided all non-isolated samples trace
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait but what if the output is nodes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm being sloppy here--- nodes are counted as the time dimension, and their values also sum to one (as time windowing is just summing node values within a window).

Comment thread python/tskit/trees.py
Comment on lines +10661 to +10662
set, then the output is divided by `(integral p(x) dx) / (num_samples *
(num_samples - 1) / 2)` (the average non-missing sequence length per
Copy link
Copy Markdown
Contributor

@petrelharp petrelharp May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? since according to this paragraph, this is just span_normalise=True, pair_normalise=True multiplied by (num_samples choose 2)? (also btw this description doesn't work for distinct sample sets)

Note: I'll believe you if you say it is right; I don't remember where we got to on this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see --- the denominator we're calculating here is "number of extant sample pairs summed across the contig" (that's p(x) at a particular coordinate x) then dividing it by the total number of pairs (however that is calculated, i.e. for distinct sample sets). So the output is just a scalar multiple of the "fully normalised" case.

Not sure what the question is exactly, let's chat about it in person then I can post a followup for clarity?

@petrelharp
Copy link
Copy Markdown
Contributor

Could we record - hopefully in the documentation - concrete interpretations of what's being returned? Let's see, IIRC this is:

  • span_normalise=False, pair_normalize=False: "find for each pair on how much of the genome do they coalesce in this node/time interval, and add that up across all pairs"
  • span_normalise=True, pair_normalize=True: "pick a uniform (pair, location) out of all the possible ones; what's the probability that that pair at that position coalesces in this node/time interval"

I'm not sure right now what the interpretations for the others are (there may not be a tidy interpretation for the case of missing data; but there should at least be for the no-missing-data case).

@petrelharp
Copy link
Copy Markdown
Contributor

Gee, this code is very nice, but there's a lot to get my head around.

I haven't yet found whether we are also comparing to the naive method to check this for correctness - is that happening?

@nspope
Copy link
Copy Markdown
Contributor Author

nspope commented May 19, 2026

I haven't yet found whether we are also comparing to the naive method to check this for correctness - is that happening?

Not in the code base, no --- were you thinking some sort of statistical check / benchmark baked into the unit tests? There's worked examples in there, but no "accuracy threshold" (which seems like it'd be a brittle test).

I've tested outside (and could post a MWE to the parent issue, if that'd help, alongside the figure shown there).

@petrelharp
Copy link
Copy Markdown
Contributor

No, I mean a very naive "iterate over all pairs and count" method that we can compare to the efficient version (should be identical): we do have this in the testing code (naive_pair_coalescence_counts); but I haven't figured out if this has been updated to account for missing data?

@nspope
Copy link
Copy Markdown
Contributor Author

nspope commented May 19, 2026

I don't think we should exclude those cases (since the interpretation is cleaner, for one thing), and I don't think this code does that.

The code actually is excluding these cases (that's the "upwards propagation" bit of the update) -- are you saying that we shouldn't? This'll certainly simplify the code but will bias the downstream statistics (rates) if so.

@nspope
Copy link
Copy Markdown
Contributor Author

nspope commented May 19, 2026

No, I mean a very naive "iterate over all pairs and count" method that we can compare to the efficient version (should be identical): we do have this in the testing code (naive_pair_coalescence_counts); but I haven't figured out if this has been updated to account for missing data?

aha --- yes, the naive implementation does the correct thing for missing data already. There is a comparison on simulations the ("messy_ts" test fixture), though it could possibly use extending.

nspope and others added 2 commits May 20, 2026 12:20
Co-authored-by: Peter Ralph <petrel.harp@gmail.com>
Co-authored-by: Peter Ralph <petrel.harp@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect normalisation in TreeSequence.pair_coalescence_counts (and TreeSequence.pair_coalescence_rates) with isolated samples

2 participants