Visualization strategies¶

Every figure in PeakATail is produced by a registered VizStrategy subclass. Strategies are grouped by plot_type (what the figure shows) and tagged by engine (matplotlib, plotly, or scanpy). Multiple engines can be rendered in a single call via render_all.

Engines are selected at runtime with --plot-engine matplotlib, plotly, both, or none. Matplotlib produces static PNG + SVG files. Plotly produces interactive HTML files. Scanpy produces figures via its built-in plotting functions.

Every figure has a companion <stem>.meta.json sidecar file. The sidecar records the strategy name, generation timestamp, PeakATail version, and plot-specific provenance keys. A figures_INDEX.md and figures_INDEX.json manifest is written to the top of each figures/ directory, grouping all sidecars for human and machine consumption. See the Figure sidecar convention section below.

`umap`¶

Registered names: umap_matplotlib, umap_plotly, umap_scanpy

Data shape: (adata, ds_id) — an AnnData with obsm["X_umap"] and optionally obs["leiden"], plus a dataset identifier string.

What it shows: A 2-D scatter plot of cells in UMAP space. Points are colored by Leiden cluster label when obs["leiden"] is present; otherwise rendered in a single color.

UMAP example: 1051 cells coloured by leiden cluster Real output from the full_v8 reference run — 12 leiden clusters.

When to interpret it: After clustering, as a sanity check that clusters are well-separated and biologically interpretable. Cells of the same cluster should form spatially coherent regions. Scattered single-cell islands in the middle of a large cluster usually indicate that n_neighbors is too low.

Cluster sizes — bar plot of cells per leiden cluster Real output: 12 clusters ranging from 14 cells (cluster 11) to 210 cells (cluster 2).

`cluster_sizes`¶

Registered names: cluster_sizes_matplotlib, cluster_sizes_plotly

Data shape: (adata, ds_id) — AnnData with obs["leiden"] populated.

What it shows: A bar chart with one bar per cluster, annotated with the exact cell count. Bars are sorted by natural cluster label order.

When to interpret it: Immediately after clustering. One dominant cluster containing > 80% of cells combined with many tiny clusters (< 5 cells) is a sign that resolution is too high or that n_neighbors is too large for the dataset size.

`peak_qc`¶

Registered names: peak_qc_matplotlib, peak_qc_plotly

Data shape: dict with keys:

Key	Type	Content
`peaks_per_chrom`	`dict[str, int]`	Number of peaks per chromosome
`per_cell_pas`	`list[int]`	Number of PAS detected per cell
`peak_widths`	`list[int]`	Width in bp of each peak
`per_cell_reads`	`list[int]`	Total reads per cell

What it shows: A 2x2 panel: (1) peaks per chromosome bar chart, (2) histogram of PAS detected per cell, (3) histogram of peak widths in bp, (4) histogram of reads per cell.

When to interpret it: Before clustering, to validate peak-calling quality. A bimodal per_cell_pas distribution may indicate a doublet population. Very wide peaks (> 500 bp) may indicate that peackcalling parameters need tighter window constraints.

peak_qc 4-panel QC Real output from the full_v8 run — chromosomal coverage (top-left), PAS-per-cell distribution (top-right), peak-width histogram (bottom-left), reads-per-cell (bottom-right).

`volcano`¶

Registered names: volcano_matplotlib, volcano_plotly

Data shape: Either a pd.DataFrame with columns ["log2fc", "qvalue", "pas_id"] (legacy), or a dict with:

Key	Type	Default	Description
`df`	DataFrame	required	Fisher or NB result with `log2fc`, `qvalue` columns
`fdr`	float	0.05	FDR threshold for the horizontal dashed line
`log2fc_thresh`	float	1.0	log2fc threshold for vertical dashed lines
`n_label`	int	10	Top-N points annotated with `gene_id` (if column present)
`cluster1`, `cluster2`	str	—	Included in figure title and meta sidecar
`strategy`	str	—	Test strategy name, included in title
`n_cells_cluster1`, `n_cells_cluster2`	int	—	Cell counts, included in title

What it shows: A scatter of all tested PAS. X-axis: log2fc (positive = higher in cluster 2). Y-axis: -log10(qvalue). Horizontal dashed line at -log10(FDR). Vertical dashed lines at ±log2fc_thresh. Significant points colored red (up in cluster 2) or blue (down in cluster 2).

When to interpret it: After ema switch diff. The upper-right quadrant contains PAS used more in cluster 2; upper-left contains PAS used more in cluster 1. Points near the horizontal line but inside the vertical lines have significant but small effect size; approach with caution.

Volcano: cluster 0 vs 4 Real output from ema switch diff on the full_v8 run. Red dots = PAS used more in cluster 4; blue dots = PAS used more in cluster 0. The top-N genes are auto-labeled.

`pdui_distribution`¶

Registered names: pdui_distribution_matplotlib, pdui_distribution_plotly, pdui_distribution_scanpy

Data shape: (adata, score_key) — AnnData with obs["leiden"] and obs[score_key] populated with per-cell PDUI scores (float in [0, 1]).

What it shows: Per-cluster violin plots of the PDUI score. Each violin shows the distribution of PDUI values across all cells in that cluster. Medians are shown as horizontal lines.

When to interpret it: After running ema switch length with --pdui-method classic. Clusters with PDUI median near 1.0 tend toward longer 3' UTRs. Clusters with median near 0.0 tend toward shorter UTRs. Bimodal violins within a cluster indicate heterogeneous PAS usage and may warrant sub-clustering.

PDUI distribution per cluster Real output: per-cluster violins of mean_pdui across the 12 leiden clusters.

`proportion_heatmap`¶

Registered names: proportion_heatmap_matplotlib, proportion_heatmap_plotly

Data shape: dict or (pdui_df, adata) tuple:

Key	Type	Default	Description
`pdui_df`	DataFrame	required	Output of `proportion` strategy (long format)
`adata`	AnnData	required	AnnData with `obs["leiden"]`
`cluster_key`	str	`"leiden"`	Obs column for cluster labels
`top_n`	int	50	Number of most-variable PAS to display

What it shows: A PAS x cluster heatmap. Rows are the top-N PAS by cross-cluster variance (most discriminating). Columns are clusters. Cell values are mean within-gene proportion across cells of that cluster. Color scale: viridis (dark = 0, bright = 1).

When to interpret it: After ema switch length --pdui-method proportion. A PAS with a bright cell in one cluster and dark in another is shifting its within-gene proportion across conditions. Use this as a visual triage before running fisher or nb_pairwise.

`entropy_distribution`¶

Registered names: entropy_distribution_matplotlib, entropy_distribution_plotly

Data shape: (adata, score_key) — AnnData with obs["leiden"] and obs[score_key] holding per-cell Shannon entropy values.

What it shows: Per-cluster violin plots of Shannon entropy. Mirrors pdui_distribution but for the entropy metric. Color scale uses viridis rather than tab10.

When to interpret it: After ema switch length --pdui-method shannon. Progenitor or cycling cell populations often show higher entropy (more uniform PAS usage) than terminally differentiated cells. Clusters where the violin is collapsed near 0 bits contain cells with highly focused PAS usage.

`diff_agreement`¶

Registered names: diff_agreement_matplotlib, diff_agreement_plotly

Data shape: dict[str, set[str]] mapping strategy name to the set of significant PAS IDs declared by that strategy.

What it shows: A symmetric N x N heatmap where each cell [i, j] is the Jaccard similarity between the significant PAS sets of strategy i and strategy j. Diagonal = 1.0 (each strategy agrees with itself). Off-diagonal values show pairwise strategy agreement. Annotated with numeric values.

When to interpret it: When running ema switch diff with multiple --diff-method values simultaneously. High Jaccard (> 0.7) between fisher and nb_pairwise indicates the results are robust. Low Jaccard (< 0.3) indicates the two tests are sensitive to different PAS or that one is anti-conservative (Fisher) relative to the other.

Only rendered when at least 2 strategies are provided; returns empty list otherwise.

`length_shifts`¶

Registered names: length_shifts_matplotlib, length_shifts_plotly

Data shape: pd.DataFrame indexed by gene_id with one column per cluster pair (e.g., "0_vs_1", "0_vs_2"). Cell values are PDUI delta (positive = lengthening in cluster 2, negative = shortening).

What it shows: A gene x cluster-pair heatmap using a red-blue diverging colormap. Red = 3' UTR lengthening in cluster 2. Blue = shortening. Rows are capped at the top 50 genes by maximum absolute delta to keep the figure readable.

When to interpret it: As a summary view across multiple cluster pair comparisons. Genes that shift across many pairs (multiple colored cells in a row) are consistent APA regulators. Genes that shift only in one pair may reflect cluster-specific biology or low-coverage noise.

`pas_overlap`¶

Registered names: pas_overlap_matplotlib, pas_overlap_plotly

Data shape: dict[str, set[str]] mapping dataset ID to the set of PAS identifiers detected in that dataset.

What it shows: For 2 datasets: a 3-bar Venn-equivalent showing counts of PAS unique to dataset A, shared by both, and unique to dataset B. For 3+ datasets: an UpSet plot (via the upsetplot library) showing intersection sizes for all combinations.

When to interpret it: After merging multiple datasets to assess how much of the PAS landscape is shared. A low overlap fraction (< 30% shared) suggests that the datasets sample different parts of the transcriptome or that detection sensitivity differs substantially between them.

`atlas_snap_diag`¶

Registered names: atlas_snap_diag_matplotlib, atlas_snap_diag_plotly

Data shape: dict with keys:

Key	Type	Description
`snapped`	int	Number of peaks that matched an atlas entry within the distance threshold
`unsnapped`	int	Number of peaks beyond the threshold
`snap_distances`	`list[int]`	Distance in bp for each snapped peak

What it shows: A 2-panel figure. Left: bar chart of snapped vs. unsnapped counts with percentage. Right: histogram of snap distances with the median marked by a red dashed line.

When to interpret it: After atlas snapping in ema run. A high unsnapped fraction (> 30%) may indicate that the snap distance threshold is too tight or that the atlas does not cover the tissue type being analyzed. A median snap distance > 50 bp suggests the atlas PAS are not well-calibrated to this protocol's read 3'-end distribution.

`cluster_match_sankey`¶

Registered names: cluster_match_sankey_matplotlib, cluster_match_sankey_plotly

Data shape: pd.DataFrame from ClusterMatchStrategy.match() with columns dataset_id, original_cluster, canonical_cluster, match_confidence, matched_to.

What it shows: One stacked bar column per dataset. Each bar segment represents one original cluster colored by its canonical cluster ID. Segment opacity encodes match_confidence (opaque = high confidence). A legend maps canonical cluster colors. Segment labels show original → canonical mapping.

When to interpret it: After ema switch match. A canonical cluster that appears in all dataset columns with similar color means all datasets agree that cell population exists. A canonical cluster appearing in only one dataset column is a population unique to that sample.

`match_confidence`¶

Registered names: match_confidence_matplotlib, match_confidence_plotly

Data shape: pd.DataFrame from ClusterMatchStrategy.match() (same as cluster_match_sankey).

What it shows: A heatmap with rows labeled {dataset}:{original_cluster} and columns labeled by canonical cluster ID. Cell value is match_confidence (0 when no match). Color: YlOrRd (light yellow = 0, dark red = 1). Cells with confidence > 0 are annotated with the numeric value.

When to interpret it: As a companion to cluster_match_sankey for auditing specific cluster pairs. High off-diagonal values (a cluster matched to two canonical IDs with similar confidence) indicate ambiguous assignment and may warrant re-running with a different n_top_markers or switching from marker_overlap to mnn.

`tile_timing`¶

Registered names: tile_timing_matplotlib, tile_timing_plotly

Data shape: list[dict] — per-tile timing records with schema:

Key	Type	Description
`dataset_id`	str	Dataset identifier
`chrom`	str	Chromosome name
`tile_idx`	int	Tile index within chromosome
`tile_start`, `tile_end`	int	Genomic coordinates
`wall_seconds`	float	Elapsed wall time for this tile

What it shows: One subplot per dataset. Bar chart sorted descending by wall_seconds. Tiles more than 2 standard deviations above the mean are highlighted red (outliers).

When to interpret it: After ema run to diagnose performance. Outlier tiles (red bars) on specific chromosomes often indicate high coverage regions (e.g., mitochondrial chromosome, highly expressed ribosomal genes) that cause clustering of reads and slow the peak-calling step. Use these to tune --threads or tile size parameters.

`resource_timeline`¶

Registered names: resource_timeline_matplotlib, resource_timeline_plotly

Data shape: dict with keys:

Key	Type	Description
`samples`	`list[dict]`	Periodic samples: `{"elapsed_s": float, "rss_gb": float, "cpu_pct": float}`
`annotations`	`list[dict]`	Stage markers: `{"label": str, "elapsed_s": float}`

What it shows: A dual-axis line chart. Left y-axis: RSS in GB (blue line with shaded fill). Right y-axis: CPU % (orange dashed line). Vertical grey dotted lines mark pipeline stage transitions with rotated labels.

When to interpret it: After any ema run to understand memory and CPU usage over time. A flat RSS line followed by a sudden spike indicates a step that materializes a large array (e.g., the count matrix densification during clustering). CPU dropping to near 0% between stage transitions indicates I/O- bound steps or the GIL blocking parallelism.

`gene_track`¶

Registered names: gene_track_matplotlib, gene_track_plotly

Data shape: A GenePanel dataclass (from ema.viz._gene_track_helpers) or dict with "panel" key and optional "top_n_clusters" (int, default 12):

GenePanel fields:

Field	Type	Description
`gene_id`	str	Gene identifier
`chrom`, `start`, `end`	str, int, int	Genomic span
`strand`	str	`"+"` or `"-"`
`pas_ids`	`list[int]`	PAS identifiers in genomic order
`pas_positions`	`list[int]`	Genomic coordinate of each PAS
`clusters`	`list[str]`	Cluster labels in display order
`n_cells_per_cluster`	`list[int]`	Cell count per cluster
`reads`	`ndarray` (n_clusters, n_pas)	Total reads per PAS per cluster
`reads_per_cell`	`ndarray` (n_clusters, n_pas)	reads / n_cells_per_cluster
`proportions`	`ndarray` (n_clusters, n_pas)	Within-gene proportion (rows sum to ~1)
`isoforms`	`list[tuple[str, list[tuple[int, int]]]]`	Optional: (transcript_id, [(exon_start, exon_end), ...])

What it shows: A stacked subplot figure for one gene. When isoform data is present, the top row shows exon bars per isoform with intron backbone lines and strand arrows. Below it, one row per cluster shows vertical bars at each PAS position. Bar height = reads/cell (depth-normalized). Bar color = within- gene proportion on the viridis scale (dark = 0%, bright = 100%). Each bar is annotated with the proportion percentage. A shared colorbar legend is added at the right.

Up to top_n_clusters (default 12) clusters are rendered, prioritized by total read count at the gene. The y-axis cap is the 95th percentile of reads/cell across all rendered clusters, preventing one outlier PAS from squashing the other bars.

When to interpret it: After ema switch geneview for a specific gene. A gene with a dominant distal bar in one cluster and a dominant proximal bar in another cluster is an APA candidate. The proportion annotation makes it easy to see whether a visual shift in bar height is meaningful (e.g., 80% vs 20%) or marginal.

The figure title includes chromosome coordinates and strand. Use it to cross-reference the gene structure with a genome browser.

Figure sidecar convention¶

Every figure written by save_matplotlib or the plotly equivalent is accompanied by a <stem>.meta.json sidecar file at the same path. The sidecar is written by ema.viz._meta.write_figure_meta.

Fixed schema keys (always present, never overwritten by the caller):

Key	Description
`figure_name`	Filename stem of the figure
`viz_strategy`	Registered strategy name (e.g. `"umap_matplotlib"`)
`generated_at`	ISO 8601 UTC timestamp
`peakatail_version`	Version string from `importlib.metadata`

Plot-specific keys are merged in by the strategy's render method and vary by plot type (e.g., n_cells, cluster_key, fdr, gene_id).

Manifest files in each figures/ directory:

figures_INDEX.json — machine-readable list of all figure entries, each containing stem, files (sibling artefacts), and meta (sidecar content).
figures_INDEX.md — human-readable summary grouped by plot type, with a concise tag line per figure. Written by ema.viz._meta.write_figures_index.

The manifest is regenerated after every command that produces figures. Read figures_INDEX.md in any output directory to get an overview of what every figure shows without opening the individual files.

Visualization strategies¶

umap¶

cluster_sizes¶

peak_qc¶

volcano¶

pdui_distribution¶

proportion_heatmap¶

entropy_distribution¶

diff_agreement¶

length_shifts¶

pas_overlap¶

atlas_snap_diag¶

cluster_match_sankey¶

match_confidence¶

tile_timing¶

resource_timeline¶

gene_track¶

Figure sidecar convention¶

`umap`¶

`cluster_sizes`¶

`peak_qc`¶

`volcano`¶

`pdui_distribution`¶

`proportion_heatmap`¶

`entropy_distribution`¶

`diff_agreement`¶

`length_shifts`¶

`pas_overlap`¶

`atlas_snap_diag`¶

`cluster_match_sankey`¶

`match_confidence`¶

`tile_timing`¶

`resource_timeline`¶

`gene_track`¶