Correctly account for isolated samples in normalisation of pair_coalescence_* methods#3460
Correctly account for isolated samples in normalisation of pair_coalescence_* methods#3460nspope wants to merge 3 commits into
pair_coalescence_* methods#3460Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3460 +/- ##
==========================================
+ Coverage 91.67% 91.68% +0.01%
==========================================
Files 38 38
Lines 32186 32248 +62
Branches 5150 5167 +17
==========================================
+ Hits 29505 29567 +62
Misses 2348 2348
Partials 333 333
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Added an additional test into the C suite for "ancestor-descendant" sample pairs |
53a8d74 to
be9f67f
Compare
|
Slight correction to the above: I think that |
| The output array has dimension `(windows, indexes, nodes)` with | ||
| dimensions dropped when the corresponding argument is set to None. |
There was a problem hiding this comment.
This should be amended to include the time_windows case, right?
There was a problem hiding this comment.
good catch, so nodes or time windows
| coalesce at position `x` (omitting "isolated" samples). If both | ||
| `span_normalise` and `pair_normalise` are true, then the output is | ||
| divided by the integral of `p(x)` over the sequence, will thus sum to | ||
| one over the "time" dimension (provided all non-isolated samples trace |
There was a problem hiding this comment.
wait but what if the output is nodes?
There was a problem hiding this comment.
I'm being sloppy here--- nodes are counted as the time dimension, and their values also sum to one (as time windowing is just summing node values within a window).
| set, then the output is divided by `(integral p(x) dx) / (num_samples * | ||
| (num_samples - 1) / 2)` (the average non-missing sequence length per |
There was a problem hiding this comment.
Is this right? since according to this paragraph, this is just span_normalise=True, pair_normalise=True multiplied by (num_samples choose 2)? (also btw this description doesn't work for distinct sample sets)
Note: I'll believe you if you say it is right; I don't remember where we got to on this.
There was a problem hiding this comment.
Let's see --- the denominator we're calculating here is "number of extant sample pairs summed across the contig" (that's p(x) at a particular coordinate x) then dividing it by the total number of pairs (however that is calculated, i.e. for distinct sample sets). So the output is just a scalar multiple of the "fully normalised" case.
Not sure what the question is exactly, let's chat about it in person then I can post a followup for clarity?
|
Could we record - hopefully in the documentation - concrete interpretations of what's being returned? Let's see, IIRC this is:
I'm not sure right now what the interpretations for the others are (there may not be a tidy interpretation for the case of missing data; but there should at least be for the no-missing-data case). |
|
Gee, this code is very nice, but there's a lot to get my head around. I haven't yet found whether we are also comparing to the naive method to check this for correctness - is that happening? |
Not in the code base, no --- were you thinking some sort of statistical check / benchmark baked into the unit tests? There's worked examples in there, but no "accuracy threshold" (which seems like it'd be a brittle test). I've tested outside (and could post a MWE to the parent issue, if that'd help, alongside the figure shown there). |
|
No, I mean a very naive "iterate over all pairs and count" method that we can compare to the efficient version (should be identical): we do have this in the testing code ( |
The code actually is excluding these cases (that's the "upwards propagation" bit of the update) -- are you saying that we shouldn't? This'll certainly simplify the code but will bias the downstream statistics (rates) if so. |
aha --- yes, the naive implementation does the correct thing for missing data already. There is a comparison on simulations the ("messy_ts" test fixture), though it could possibly use extending. |
Co-authored-by: Peter Ralph <petrel.harp@gmail.com>
Co-authored-by: Peter Ralph <petrel.harp@gmail.com>
Fixes #3459
There's the somewhat thorny issue of what the correct normalisation is when only one of$\int p(x) dx / (b - a)$ for "pair normalisation" (so the units are "sequence length per pair") -- this reduces to ${n \choose 2}$ when there's no isolated samples which agrees with the original implementation. Similarly I normalise by the "average pair span" $\int p(x) dx / {n \choose 2}$ for "span normalisation" (so that the units are "# of sample pairs") -- which simplifies to $b - a$ as per the original implementation where there are no isolated samples.
span_normaliseorpair_normaliseareTrue. In this case (using the notation of the linked issue), what I've done is normalise by the "average number of pairs"So the output (for the $i$th node or time window) is one of:
The above assumes that there's no missing trees over the interval--- if there are, replace$b - a$ by "nonmissing span" (the original implementation discounted missing span as well)