Post-Detection PAS-Summit Merger¶

A strategy-agnostic merge step that runs after every strategy.find_pas(peak) call, collapsing adjacent PAS regions that are too close together or share too shallow a valley to be biologically distinct.

Why this step exists¶

The multi-PAS peak-calling strategies (lambda_gradient, sierra_iterative) produce one or more summit positions per peak, then split the peak's genomic span among them by midpoint cut. This split has two failure modes that no upstream knob catches:

Twin summits. On low-coverage regions, the smoothed gradient can show multiple shallow oscillations within a few base pairs. Each one clears the per-summit prominence threshold individually and gets emitted, producing pairs of PAS regions that differ by 1 bp at the partition boundary. No single read can span across both — they are not biologically distinguishable.
Shoulder peaks. Two summits 20–80 bp apart with only a shallow dip between them are emitted as separate PAS even when the valley sits at the local Poisson background. These are one peak with a noisy double-top.

Neither --merge-len (read-cluster grouping, upstream of any strategy) nor --pas-gap (cross-dataset BED merging, downstream of the full pipeline) addresses within-peak summit spacing. The post-detection merger fills that gap.

Two-tier design¶

The merger lives in ema/countmatrix/peackcalling.py and is wired into all three live execution paths (monolithic, pipeline, tile). Strategy-agnostic by construction — it operates on the [(start, end), …] contract that every strategy already returns.

Tier 1 — Distance gate¶

if next_region.start - prev_region.end < min_pas_spacing:
    merge them

min_pas_spacing defaults to -1 which triggers auto-detection of the median aligned read length from the BAM header (first 1,000 mapped reads). Two summits closer than one aligned read length cannot be resolved by any short-read library — a single read straddles both, so they must be the same PAS by construction.

Typical auto-detected values:

Library	Auto-detected `min_pas_spacing`
10x Chromium v2	~98 bp
10x Chromium v3	~91 bp
Smart-seq2	~150 bp
Long-read (Iso-Seq)	very large — Tier 2 dominates

Set --min-pas-spacing 0 to disable Tier 1 (e.g. when you trust the strategy's own distance handling).

Tier 2 — Lambda-anchored valley¶

If the gap passes Tier 1, the valley between the two regions is compared against a threshold supplied by the strategy itself:

class PeakFinderStrategy:
    def valley_threshold(self, peak, user_min_pas_prominence):
        return user_min_pas_prominence   # base: static fallback

class LambdaGradientStrategy(PeakFinderStrategy):
    def valley_threshold(self, peak, _user):
        return compute_lambda(           # override: dynamic, per-peak
            [h for _, h in peak.peak_list],
            method=self.lambda_method,
            floor_fraction=self.floor_fraction,
        )

Lambda strategies (lambda_poisson, lambda_gradient) override to return their own compute_lambda(heights) — the same background estimate they use for the region significance gate. Fully dynamic; the user's --min-pas-prominence is ignored for these strategies.
Non-lambda strategies (original, sierra_iterative) fall through to the base implementation which returns min_pas_prominence unchanged. The default value is 5.0 coverage units.

The pair is merged when valley_height < threshold. Set --min-pas-prominence -1 to disable Tier 2 regardless of strategy.

Reads in the merge gap are absorbed automatically¶

The pipeline's peak.cb_positions dictionary is populated densely across the entire parent peak — every read end position contributes, not only summit positions. The next pipeline call after merging, strategy.get_cb_dict_for_pas(peak, merged_start, merged_end), routes through reconstruct_cb_dict which slices cb_positions by inclusive range. Reads that fell in the gap between the two pre-merge regions are therefore added to the survivor's barcode counts without any explicit summing logic. The merger only needs to widen the genomic span; read absorption is a free consequence of how reconstruction already works.

Configuration¶

CLI flag	YAML key	Default	What 0/disabled means
`--min-pas-spacing N`	`min_pas_spacing: N`	`-1` (auto = median read length)	`0` disables Tier 1
`--min-pas-prominence X`	`min_pas_prominence: X`	`5.0`	`-1` (or any negative) disables Tier 2

Combinations:

# Defaults: both tiers active (recommended)
ema run --config example.yaml

# Distance only — for strategies that already have good valley logic
ema run --config example.yaml --min-pas-prominence -1

# Prominence only — for libraries where you don't trust the read-length
# auto-detect, but still want lambda-driven valley collapse
ema run --config example.yaml --min-pas-spacing 0

# Baseline (no merge at all) — exact pre-merger behaviour, for reproducibility
ema run --config example.yaml --min-pas-spacing 0 --min-pas-prominence -1

Wiring across execution paths¶

The merger runs in all three live peak-calling code paths:

Monolithic (peackcalling.py, default sequential): 5 find_pas call sites, each followed by a one-line merge_close_or_low_prominence(…).
Pipeline (peak_pipeline.py, --pipeline): merger applied inside the writer subprocess after every find_pas. Parameters passed at spawn time.
Tile (tile_runner.py, --tiles): each tile worker calls the monolithic peak_calling() and inherits the merger automatically. Sentinel resolution happens once at pool startup so tile workers don't re-scan the BAM header.

Equivalence between the monolithic and pipeline paths is verified on the chr22 test BAM with lambda_gradient: both produce bit-identical 252-PAS output.

Effect on the reference v9 dataset¶

On SRR8325947 with lambda_gradient and default merger settings:

Metric	Before merger	After merger	Δ
Total PAS	15,948	15,568	−380
Same-strand pairs < 30 bp	371	0	−371
Filtered cells	752	752	0
Leiden clusters	11	11	0
Genes (classic length)	1,925	1,898	−27
Genes (proportion length)	5,339	5,407	+68
Genes (shannon entropy)	5,339	5,407	+68

Counter-intuitively, the gene count for proportion/shannon length increases — when twin PAS whose narrow split-boundaries straddled a gene boundary are merged into one wider region, the merged region's overlap with a single gene becomes unambiguous and the gene becomes eligible for length-shift analysis.