Differential APA strategies¶

Differential APA testing identifies poly(A) sites that are used at different rates between two or more cell populations. All three strategies operate on a per-cell count matrix of shape (n_cells, n_pas) and a cluster_labels Series mapping cell barcodes to cluster names. Each strategy applies Benjamini-Hochberg FDR correction across tested PAS and returns a DataFrame indexed by pas_id.

Choose a strategy based on the trade-off between speed, statistical rigor, and whether you need to compare two clusters or all clusters at once.

For CLI usage see ../cli/switch-diff.md.

`fisher`¶

Question: For each PAS within a gene, does the within-gene proportion of reads at that PAS differ significantly between cluster 1 and cluster 2?

Algorithm¶

Aggregate reads per PAS across all cells in cluster 1 and cluster 2 separately (agg1, agg2).
Group PAS by gene using pas_gene_map. When pas_gene_map is not supplied, a warning is logged and all PAS are treated as a single global group (legacy behavior, loses the APA-specific framing).
For each gene G with at least 2 PAS, and for each PAS p in G, construct a 2x2 contingency table:

                      cluster 1        cluster 2
reads at p           reads_p_c1       reads_p_c2
reads at other PAS   other_c1         other_c2

where other_cX = total gene reads in cX - reads_p_cX.

Run two-sided Fisher exact test via scipy.stats.fisher_exact.
Compute within-gene proportions prop1 = reads_p_c1 / gene_reads_c1 and prop2 = reads_p_c2 / gene_reads_c2.
Compute delta_proportion = prop1 - prop2 and log2fc = log2((prop2 + eps) / (prop1 + eps)) where eps = 1 / (max(gene_reads_c1, gene_reads_c2) + 1) prevents log(0).
Apply Benjamini-Hochberg FDR to all p-values using scipy.stats.false_discovery_control.

This framing is the canonical APA differential-usage design matching the DEXSeq-for-splicing paradigm: the test asks "for this gene, does this PAS change its within-gene share?" rather than "does this PAS change relative to all other PAS genome-wide?".

Inputs and outputs¶

Inputs:

count_matrix: pd.DataFrame shape (n_cells, n_pas)
cluster_labels: pd.Series indexed by cell barcode
cluster1, cluster2: cluster label strings (both required)
pas_gene_map: dict[str, str] mapping PAS ID string to gene ID (recommended)
min_cells_per_group: minimum cells per cluster (total count, not non-zero)

Output columns (indexed by pas_id):

Column	Type	Description
`pvalue`	float	Two-sided Fisher p-value
`qvalue`	float	BH FDR-corrected q-value
`n_cells`	int	Total cells in both clusters
`n_cells_cluster1`	int	Cells in cluster 1
`n_cells_cluster2`	int	Cells in cluster 2
`n_cells_expr_cluster1`	int	Cells with non-zero count at this PAS in cluster 1
`n_cells_expr_cluster2`	int	Cells with non-zero count at this PAS in cluster 2
`n_reads_pas_cluster1`	int	Total reads at this PAS in cluster 1
`n_reads_pas_cluster2`	int	Total reads at this PAS in cluster 2
`n_reads_gene_cluster1`	int	Total reads at all PAS of this gene in cluster 1
`n_reads_gene_cluster2`	int	Total reads at all PAS of this gene in cluster 2
`odds_ratio`	float	Fisher odds ratio
`delta_proportion`	float	prop1 - prop2 (within-gene proportions)
`log2fc`	float	log2(prop2 / prop1), positive = higher in cluster 2

Results are sorted by qvalue ascending.

Tunable hyperparameters¶

Parameter	Default	CLI flag	YAML key	Description
`min_cells_per_group`	10	`--min-cells-per-group`	`min_cells_per_group`	Minimum cells in each cluster for a PAS to be testable.
`fdr`	0.05	`--fdr`	`fdr`	BH FDR threshold applied by the caller when filtering results.

Interpretation¶

A significant result (qvalue < FDR threshold) with delta_proportion > 0 means PAS p captures a larger fraction of its gene's reads in cluster 1 than in cluster 2. The volcano figure plots log2fc vs -log10(qvalue); significant PAS appear in the upper-left (down in cluster 2) or upper-right (up in cluster 2) quadrants beyond the dashed threshold lines.

Audit a result row: if n_reads_gene_cluster1 = 5 the test was run on very few reads and the odds ratio is unreliable. Use n_reads_pas_cluster1 and n_reads_gene_cluster1 to filter low-coverage results before reporting.

Limitations¶

Pseudo-replication: Fisher operates on aggregated (pseudo-bulk) read counts. Treating all cells in a cluster as a single observation inflates statistical power. For rigorous cell-level testing use nb_pairwise.
Genes with fewer than 2 PAS are silently excluded (no within-gene comparison is possible).
Fisher is anti-conservative for overdispersed count data. A q-value of 0.01 from Fisher should be treated as directional evidence, not as a precise error rate.

`nb_pairwise`¶

Question: For each PAS individually, is the per-cell read count in cluster 1 significantly different from cluster 2 after correcting for library size?

Algorithm¶

Restrict the count matrix to cells in cluster 1 and cluster 2.
For each PAS, filter cells with fewer than min_cells_per_group non-zero counts in either group; skip that PAS.
Estimate per-PAS dispersion using a Negative Binomial MLE (statsmodels.discrete.discrete_model.NegativeBinomial, method="bfgs", maxiter=200). If the solver fails to converge, fall back to a method-of-moments estimate:
```
alpha_hat = (sample_variance - mean) / mean^2
```

Dispersion is clipped to [1e-4, 10.0] for numerical stability.

Fit a NB GLM with fixed dispersion:

count ~ cluster_indicator + offset(log(library_size))

using IRLS (method="irls", maxiter=200). The cluster indicator is 0 for cluster 1 and 1 for cluster 2.

Extract the Wald z-statistic and p-value for the cluster coefficient. log2fc = coefficient / log(2).
Apply BH FDR across all tested PAS.

Per-PAS GLM fits are embarrassingly parallel. Dense numpy slices (cells x 1 per PAS) are pre-extracted once before dispatch so workers never pickle large sparse matrices. Parallelism is capped by ResourceManager with per_worker_mb=300.

Inputs and outputs¶

Inputs:

count_matrix: pd.DataFrame shape (n_cells, n_pas)
cluster_labels: pd.Series
cluster1, cluster2: cluster label strings (both required)
min_cells_per_group: minimum non-zero cells per group per PAS

Output columns (indexed by pas_id):

Column	Type	Description
`pvalue`	float	Wald test p-value
`qvalue`	float	BH FDR-corrected q-value
`log2fc`	float	log2 fold change (cluster2 / cluster1)
`dispersion`	float	Estimated NB dispersion alpha
`n_cells`	int	Total cells in both clusters
`test_stat`	float	Wald z-statistic

Results are sorted by qvalue ascending.

Tunable hyperparameters¶

Parameter	Default	CLI flag	YAML key	Description
`min_cells_per_group`	10	`--min-cells-per-group`	`min_cells_per_group`	Minimum cells with non-zero counts in each cluster per PAS. Raise to reduce noisy low-count tests.
`fdr`	0.05	`--fdr`	`fdr`	BH FDR threshold applied when filtering results.

Interpretation¶

nb_pairwise operates at the cell level and correctly accounts for library size differences between clusters via the offset(log(library_size)) term. A positive log2fc means the PAS is more abundant in cluster 2. The dispersion column shows the per-PAS overdispersion; values near 0.0001 (the clip floor) may indicate model instability and should be treated cautiously.

The fallback_used flag (dropped from the output DataFrame but visible in logs) is set when the MLE solver failed and the method-of-moments path was used. Filter by dispersion < 9 to exclude rows where the dispersion hit the upper clip and the GLM may be unreliable.

Limitations¶

Significantly slower than Fisher (10-100x depending on cell count and parallelism). On 50,000 cells x 20,000 PAS with 8 workers, expect 30-90 minutes.
Does not implement a within-gene normalization. A PAS that simply has more reads because its gene is more expressed will appear significant. Combine with proportion output to distinguish expression changes from APA shifts.
Not available for multi-condition comparisons; use nb_multi when comparing three or more clusters simultaneously.

`nb_multi`¶

Question: Does the read count at each PAS differ significantly across all clusters at once?

Algorithm¶

Use all clusters in cluster_labels (not just a pair). Build a one-hot dummy matrix (K-1 columns, drop-first) plus an intercept column as the full model design matrix.
Per-PAS, require min_cells_per_group cells with non-zero counts total, and signal in at least 2 clusters; otherwise skip.
Estimate dispersion using a full-model NB MLE (same procedure as nb_pairwise, method="bfgs", maxiter=200, MoM fallback).
Fit full model: count ~ C(cluster) + offset(log(library_size)).
Fit null model: count ~ 1 + offset(log(library_size)).
Likelihood ratio test statistic: LRT = 2 * (LL_full - LL_null), which follows chi-squared with K - 1 degrees of freedom.
Apply BH FDR across all tested PAS.

Parallelized over PAS batches identically to nb_pairwise.

Inputs and outputs¶

Inputs:

count_matrix: pd.DataFrame shape (n_cells, n_pas)
cluster_labels: pd.Series — all unique values define the groups
cluster1, cluster2: accepted for interface compatibility but ignored
min_cells_per_group: minimum total non-zero cells across all clusters per PAS

Output columns (indexed by pas_id):

Column	Type	Description
`pvalue`	float	LRT chi-squared p-value
`qvalue`	float	BH FDR-corrected q-value
`test_stat`	float	LRT statistic (2 * delta log-likelihood)
`df`	int	Degrees of freedom (K - 1)
`dispersion`	float	NB dispersion alpha from full model
`n_cells`	int	Total cells across all clusters

Results are sorted by qvalue ascending. There is no log2fc column because the test is omnibus (K > 2 levels).

Tunable hyperparameters¶

Parameter	Default	CLI flag	YAML key	Description
`min_cells_per_group`	10	`--min-cells-per-group`	`min_cells_per_group`	Minimum total non-zero cells for a PAS to be tested. Requires signal in at least 2 clusters.
`fdr`	0.05	`--fdr`	`fdr`	BH FDR threshold applied when filtering results.

Interpretation¶

nb_multi tells you which PAS change across any cluster contrast without specifying which pair. It is the right first pass when you have many clusters and want a ranked list of the most dynamically regulated PAS. Follow up significant results from nb_multi with nb_pairwise on the cluster pairs of interest to get directional log2fc values.

A significant result with df = K - 1 degrees of freedom: the test_stat on a chi-squared distribution with K-1 df gives the p-value. Check that test_stat >> 0 for significant PAS; test_stat near 0 despite low p-value is a numerical artifact.

Limitations¶

No log2fc column. You must re-run nb_pairwise or examine cluster-level coefficients from the full model to identify direction and magnitude.
Requires at least 2 cluster levels; raises ValueError if fewer are found.
Runtime scales with K (number of clusters) and n_PAS. With K=10 clusters and 20,000 PAS, expect 2-5x the runtime of nb_pairwise for a two-cluster test.

Differential APA strategies¶

fisher¶

Algorithm¶

Inputs and outputs¶

Tunable hyperparameters¶

Interpretation¶

Limitations¶

nb_pairwise¶

Algorithm¶

Inputs and outputs¶

Tunable hyperparameters¶

Interpretation¶

Limitations¶

nb_multi¶

Algorithm¶

Inputs and outputs¶

Tunable hyperparameters¶

Interpretation¶

Limitations¶

`fisher`¶

`nb_pairwise`¶

`nb_multi`¶