Differential APA strategies¶
Differential APA testing identifies poly(A) sites that are used at different
rates between two or more cell populations. All three strategies operate on a
per-cell count matrix of shape (n_cells, n_pas) and a cluster_labels Series
mapping cell barcodes to cluster names. Each strategy applies Benjamini-Hochberg
FDR correction across tested PAS and returns a DataFrame indexed by pas_id.
Choose a strategy based on the trade-off between speed, statistical rigor, and whether you need to compare two clusters or all clusters at once.
For CLI usage see ../cli/switch-diff.md.
fisher¶
Question: For each PAS within a gene, does the within-gene proportion of reads at that PAS differ significantly between cluster 1 and cluster 2?
Algorithm¶
- Aggregate reads per PAS across all cells in cluster 1 and cluster 2
separately (
agg1,agg2). - Group PAS by gene using
pas_gene_map. Whenpas_gene_mapis not supplied, a warning is logged and all PAS are treated as a single global group (legacy behavior, loses the APA-specific framing). - For each gene G with at least 2 PAS, and for each PAS p in G, construct a 2x2 contingency table:
where other_cX = total gene reads in cX - reads_p_cX.
- Run two-sided Fisher exact test via
scipy.stats.fisher_exact. - Compute within-gene proportions
prop1 = reads_p_c1 / gene_reads_c1andprop2 = reads_p_c2 / gene_reads_c2. - Compute
delta_proportion = prop1 - prop2andlog2fc = log2((prop2 + eps) / (prop1 + eps))whereeps = 1 / (max(gene_reads_c1, gene_reads_c2) + 1)prevents log(0). - Apply Benjamini-Hochberg FDR to all p-values using
scipy.stats.false_discovery_control.
This framing is the canonical APA differential-usage design matching the DEXSeq-for-splicing paradigm: the test asks "for this gene, does this PAS change its within-gene share?" rather than "does this PAS change relative to all other PAS genome-wide?".
Inputs and outputs¶
Inputs:
count_matrix:pd.DataFrameshape(n_cells, n_pas)cluster_labels:pd.Seriesindexed by cell barcodecluster1,cluster2: cluster label strings (both required)pas_gene_map:dict[str, str]mapping PAS ID string to gene ID (recommended)min_cells_per_group: minimum cells per cluster (total count, not non-zero)
Output columns (indexed by pas_id):
| Column | Type | Description |
|---|---|---|
pvalue |
float | Two-sided Fisher p-value |
qvalue |
float | BH FDR-corrected q-value |
n_cells |
int | Total cells in both clusters |
n_cells_cluster1 |
int | Cells in cluster 1 |
n_cells_cluster2 |
int | Cells in cluster 2 |
n_cells_expr_cluster1 |
int | Cells with non-zero count at this PAS in cluster 1 |
n_cells_expr_cluster2 |
int | Cells with non-zero count at this PAS in cluster 2 |
n_reads_pas_cluster1 |
int | Total reads at this PAS in cluster 1 |
n_reads_pas_cluster2 |
int | Total reads at this PAS in cluster 2 |
n_reads_gene_cluster1 |
int | Total reads at all PAS of this gene in cluster 1 |
n_reads_gene_cluster2 |
int | Total reads at all PAS of this gene in cluster 2 |
odds_ratio |
float | Fisher odds ratio |
delta_proportion |
float | prop1 - prop2 (within-gene proportions) |
log2fc |
float | log2(prop2 / prop1), positive = higher in cluster 2 |
Results are sorted by qvalue ascending.
Tunable hyperparameters¶
| Parameter | Default | CLI flag | YAML key | Description |
|---|---|---|---|---|
min_cells_per_group |
10 | --min-cells-per-group |
min_cells_per_group |
Minimum cells in each cluster for a PAS to be testable. |
fdr |
0.05 | --fdr |
fdr |
BH FDR threshold applied by the caller when filtering results. |
Interpretation¶
A significant result (qvalue < FDR threshold) with delta_proportion > 0
means PAS p captures a larger fraction of its gene's reads in cluster 1 than
in cluster 2. The volcano figure plots log2fc vs -log10(qvalue);
significant PAS appear in the upper-left (down in cluster 2) or upper-right
(up in cluster 2) quadrants beyond the dashed threshold lines.
Audit a result row: if n_reads_gene_cluster1 = 5 the test was run on very
few reads and the odds ratio is unreliable. Use n_reads_pas_cluster1 and
n_reads_gene_cluster1 to filter low-coverage results before reporting.
Limitations¶
- Pseudo-replication: Fisher operates on aggregated (pseudo-bulk) read counts.
Treating all cells in a cluster as a single observation inflates statistical
power. For rigorous cell-level testing use
nb_pairwise. - Genes with fewer than 2 PAS are silently excluded (no within-gene comparison is possible).
- Fisher is anti-conservative for overdispersed count data. A q-value of 0.01 from Fisher should be treated as directional evidence, not as a precise error rate.
nb_pairwise¶
Question: For each PAS individually, is the per-cell read count in cluster 1 significantly different from cluster 2 after correcting for library size?
Algorithm¶
- Restrict the count matrix to cells in cluster 1 and cluster 2.
- For each PAS, filter cells with fewer than
min_cells_per_groupnon-zero counts in either group; skip that PAS. -
Estimate per-PAS dispersion using a Negative Binomial MLE (
statsmodels.discrete.discrete_model.NegativeBinomial, method="bfgs", maxiter=200). If the solver fails to converge, fall back to a method-of-moments estimate:
Dispersion is clipped to [1e-4, 10.0] for numerical stability.
- Fit a NB GLM with fixed dispersion:
count ~ cluster_indicator + offset(log(library_size))
using IRLS (method="irls", maxiter=200). The cluster indicator is 0 for cluster 1 and 1 for cluster 2.
-
Extract the Wald z-statistic and p-value for the cluster coefficient.
log2fc = coefficient / log(2). -
Apply BH FDR across all tested PAS.
Per-PAS GLM fits are embarrassingly parallel. Dense numpy slices (cells x 1
per PAS) are pre-extracted once before dispatch so workers never pickle large
sparse matrices. Parallelism is capped by ResourceManager with
per_worker_mb=300.
Inputs and outputs¶
Inputs:
count_matrix:pd.DataFrameshape(n_cells, n_pas)cluster_labels:pd.Seriescluster1,cluster2: cluster label strings (both required)min_cells_per_group: minimum non-zero cells per group per PAS
Output columns (indexed by pas_id):
| Column | Type | Description |
|---|---|---|
pvalue |
float | Wald test p-value |
qvalue |
float | BH FDR-corrected q-value |
log2fc |
float | log2 fold change (cluster2 / cluster1) |
dispersion |
float | Estimated NB dispersion alpha |
n_cells |
int | Total cells in both clusters |
test_stat |
float | Wald z-statistic |
Results are sorted by qvalue ascending.
Tunable hyperparameters¶
| Parameter | Default | CLI flag | YAML key | Description |
|---|---|---|---|---|
min_cells_per_group |
10 | --min-cells-per-group |
min_cells_per_group |
Minimum cells with non-zero counts in each cluster per PAS. Raise to reduce noisy low-count tests. |
fdr |
0.05 | --fdr |
fdr |
BH FDR threshold applied when filtering results. |
Interpretation¶
nb_pairwise operates at the cell level and correctly accounts for library
size differences between clusters via the offset(log(library_size)) term.
A positive log2fc means the PAS is more abundant in cluster 2. The
dispersion column shows the per-PAS overdispersion; values near 0.0001
(the clip floor) may indicate model instability and should be treated
cautiously.
The fallback_used flag (dropped from the output DataFrame but visible in
logs) is set when the MLE solver failed and the method-of-moments path was
used. Filter by dispersion < 9 to exclude rows where the dispersion hit
the upper clip and the GLM may be unreliable.
Limitations¶
- Significantly slower than Fisher (10-100x depending on cell count and parallelism). On 50,000 cells x 20,000 PAS with 8 workers, expect 30-90 minutes.
- Does not implement a within-gene normalization. A PAS that simply has
more reads because its gene is more expressed will appear significant.
Combine with
proportionoutput to distinguish expression changes from APA shifts. - Not available for multi-condition comparisons; use
nb_multiwhen comparing three or more clusters simultaneously.
nb_multi¶
Question: Does the read count at each PAS differ significantly across all clusters at once?
Algorithm¶
- Use all clusters in
cluster_labels(not just a pair). Build a one-hot dummy matrix (K-1 columns, drop-first) plus an intercept column as the full model design matrix. - Per-PAS, require
min_cells_per_groupcells with non-zero counts total, and signal in at least 2 clusters; otherwise skip. - Estimate dispersion using a full-model NB MLE (same procedure as
nb_pairwise, method="bfgs", maxiter=200, MoM fallback). - Fit full model:
count ~ C(cluster) + offset(log(library_size)). - Fit null model:
count ~ 1 + offset(log(library_size)). - Likelihood ratio test statistic:
LRT = 2 * (LL_full - LL_null), which follows chi-squared withK - 1degrees of freedom. - Apply BH FDR across all tested PAS.
Parallelized over PAS batches identically to nb_pairwise.
Inputs and outputs¶
Inputs:
count_matrix:pd.DataFrameshape(n_cells, n_pas)cluster_labels:pd.Series— all unique values define the groupscluster1,cluster2: accepted for interface compatibility but ignoredmin_cells_per_group: minimum total non-zero cells across all clusters per PAS
Output columns (indexed by pas_id):
| Column | Type | Description |
|---|---|---|
pvalue |
float | LRT chi-squared p-value |
qvalue |
float | BH FDR-corrected q-value |
test_stat |
float | LRT statistic (2 * delta log-likelihood) |
df |
int | Degrees of freedom (K - 1) |
dispersion |
float | NB dispersion alpha from full model |
n_cells |
int | Total cells across all clusters |
Results are sorted by qvalue ascending. There is no log2fc column because
the test is omnibus (K > 2 levels).
Tunable hyperparameters¶
| Parameter | Default | CLI flag | YAML key | Description |
|---|---|---|---|---|
min_cells_per_group |
10 | --min-cells-per-group |
min_cells_per_group |
Minimum total non-zero cells for a PAS to be tested. Requires signal in at least 2 clusters. |
fdr |
0.05 | --fdr |
fdr |
BH FDR threshold applied when filtering results. |
Interpretation¶
nb_multi tells you which PAS change across any cluster contrast without
specifying which pair. It is the right first pass when you have many clusters
and want a ranked list of the most dynamically regulated PAS. Follow up
significant results from nb_multi with nb_pairwise on the cluster pairs
of interest to get directional log2fc values.
A significant result with df = K - 1 degrees of freedom: the test_stat
on a chi-squared distribution with K-1 df gives the p-value. Check that
test_stat >> 0 for significant PAS; test_stat near 0 despite low p-value
is a numerical artifact.
Limitations¶
- No
log2fccolumn. You must re-runnb_pairwiseor examine cluster-level coefficients from the full model to identify direction and magnitude. - Requires at least 2 cluster levels; raises
ValueErrorif fewer are found. - Runtime scales with K (number of clusters) and n_PAS. With K=10 clusters
and 20,000 PAS, expect 2-5x the runtime of
nb_pairwisefor a two-cluster test.