ema run¶
ema run executes the complete PeakATail pipeline: it reads one or more BAM
files, calls poly(A) sites (PAS) per strand, builds cell-by-PAS count matrices,
annotates PAS with gene identities from a GTF file, filters low-quality
barcodes, and clusters cells using TF-IDF + LSI + Leiden. One clusters.h5ad
file is produced per dataset and placed inside a timestamped output directory.
All switch subcommands (diff, length, match, geneview) consume these
files as their primary input.
When to use it
- You have one or more 10x Chromium scRNA-seq BAM files and want to discover PAS and cluster cells by 3' UTR usage.
- You want a single command that produces all intermediate files (BEDs, count matrices, AnnData) for downstream analysis.
- You are running a new sample or reprocessing with changed peak-calling or clustering parameters.
When NOT to use it
- You already have
clusters.h5adfiles from a previous run and only want to test differential APA. Useema switch diffdirectly. - You want to change only the clustering resolution without re-running peak
calling. Re-running
ema runrepeats the entire pipeline; there is currently no checkpoint resume.
Quick example¶
uv run ema run \
--config example.yaml \
--threads 8 \
--peak-strategy lambda_gradient \
--cluster-method leiden_tfidf \
--resolution 0.8 \
-vv
After this command completes, the following files are on disk (example with
--output emaout and dataset id sample1):
peakatail_runs/emaout_2026-05-11_120000/per_dataset/sample1/clusters.h5ad— AnnData with Leiden cluster labels inobs["leiden"]and gene annotations invar["gene_id"].peakatail_runs/emaout_2026-05-11_120000/per_dataset/sample1/pasbed.bed— Filtered BED6 PAS coordinates consumed byswitch diffandswitch geneview.peakatail_runs/emaout_2026-05-11_120000/per_dataset/sample1/pas_gene.tsv— Two-column TSV mapping everypas_idto agene_id.peakatail_runs/emaout_2026-05-11_120000/run_config.json— Resolved parameters for reproducibility.peakatail_runs/emaout_2026-05-11_120000/resources.jsonl— Per-second RSS + CPU samples (requirespsutil).
Full --help output¶
Usage: ema run [OPTIONS]
Run the full PeakATail pipeline.
Options:
--threads INTEGER Max parallel workers (auto-detected if not
set). Respected by ResourceManager as an
absolute ceiling.
-v, --verbose Increase verbosity. -v = DEBUG for ema.*;
-vv = DEBUG everywhere.
-q, --quiet WARNING and up only. Overrides --verbose.
--log-level TEXT Explicit logger level (DEBUG/INFO/WARNING
/ERROR) or `logger.name=LEVEL` (repeatable:
comma-separated).
--no-log-file Don't write peakatail_<ts>.log next to the
outputs.
--no-progress Suppress Rich progress bars.
-c, --config PATH YAML config; CLI flags override individual
keys.
-o, --output PATH Output directory (timestamp suffix added
automatically).
--plot-engine TEXT Engines: 'matplotlib' (default), 'plotly',
'both', 'none', or comma list.
--plot-format TEXT Restrict output formats. Default 'all' =
png+svg+html as appropriate. Examples: 'svg'
/ 'png,svg' / 'html'.
--no-plots Disable all plotting (alias for --plot-
engine none).
--bam-dir PATH Single-BAM convenience.
--bam-files TEXT Comma-separated multi-BAM list.
--gtf PATH GTF annotation file.
--atlas PATH Reference PAS atlas BED.
--atlas-distance INTEGER Atlas snap distance (bp). [default: 50]
--seq-len INTEGER Sequencing read length.
--cb-len INTEGER Cell-barcode length (bp).
--barcode-tag TEXT BAM tag holding the cell barcode (default
CB).
--bam-threads INTEGER pysam decompression threads per BAM.
[default: 4]
--pipeline Pipeline mode (currently only used by
--tiles).
--batch-size INTEGER Worker batch size for streaming reads.
[default: 10000]
--tiles Enable tile-based peak calling (parallel
pool).
--tile-size INTEGER Tile size in bp (auto if unset).
--tile-overlap INTEGER Tile overlap in bp. [default: 10000]
--peak-strategy TEXT Peak-calling strategy (run --list-strategies
to see). [default: lambda_gradient]
--lambda-window INTEGER Background lambda estimation window (bp).
[default: 5000]
--lambda-method TEXT Lambda estimator (median / mean / ...).
[default: median]
--lambda-fold-change FLOAT Lambda fold-change cutoff for peak
detection. [default: 2.0]
--max-pas INTEGER Maximum PAS sites kept per peak. [default:
5]
--smoothing-window INTEGER Coverage smoothing window (bp). [default:
50]
--min-prominence FLOAT scipy.signal.find_peaks prominence
threshold. [default: 5.0]
--dynamic-threshold Use a per-window dynamic peak threshold.
--floor-threshold INTEGER Minimum peak height (clamps dynamic
threshold). [default: 3]
--pas-gap INTEGER Minimum gap between PAS within a peak (bp).
[default: 100]
--min-pas-spacing INTEGER Tier-1 (distance) of the post-detection PAS
merger. Adjacent PAS within one peak whose
gap is below this value are merged uncon-
ditionally. -1 (default) auto-detects the
median read length per BAM; 0 disables the
distance tier. [default: -1]
--min-pas-prominence FLOAT Tier-2 (valley) static fallback for the
post-detection PAS merger. Lambda strategies
(lambda_poisson, lambda_gradient) ignore
this and use their own compute_lambda(...).
Non-lambda strategies (original,
sierra_iterative) use this value as the
valley-depth threshold. Negative disables
Tier 2. [default: 5.0]
--ip-filter Enable internal-priming filter (currently
no-op).
--genome-fasta PATH Genome FASTA for --ip-filter (currently no-
op).
--annot-filter Annotation filter (currently no-op).
--ip-a-stretch INTEGER A-stretch length for --ip-filter (currently
no-op). [default: 6]
--min-pas-per-cell INTEGER Minimum PAS per cell (also bridges to
filter_config.min_genes). [default: 50]
--min-read INTEGER Minimum reads per cell barcode. [default:
1500]
--min-cells INTEGER Minimum cells expressing a PAS. [default:
3]
--max-gene-distance INTEGER Max distance for gene-end annotation (bp).
[default: 5000]
--utr-multiplier FLOAT 3'UTR length multiplier for extended-3'
annotation. [default: 2.0]
--include-extended Include extended-3' annotations.
--cluster-method TEXT Clustering strategy (leiden_tfidf /
leiden_libsize / external). [default:
leiden_tfidf]
--resolution FLOAT Leiden resolution. [default: 1.0]
--n-pcs INTEGER Number of principal components. [default:
40]
--external-clusters PATH Path to external cluster labels TSV.
--random-seed INTEGER RNG seed for clustering reproducibility.
[default: 42]
--match-method TEXT Cross-dataset cluster match strategy.
[default: marker_overlap]
--n-top-markers INTEGER Number of top marker PAS per cluster.
[default: 50]
--n-neighbors INTEGER Number of nearest neighbours for the kNN
graph used by Leiden (leiden_tfidf default:
30; leiden_libsize default: 10 — set
--n-neighbors explicitly to override).
[default: 30]
--tfidf-scale-factor FLOAT Scale factor for Signac Method 1 TF-IDF
(leiden_tfidf strategy). Default 10000.
[default: 10000.0]
--depth-corr-threshold FLOAT Pearson |r| threshold for removing LSI
components correlated with sequencing depth
(leiden_tfidf, ArchR-style). Default 0.75.
[default: 0.75]
--n-svd-components INTEGER Number of SVD/PCA components computed before
filtering/neighbor graph (leiden_tfidf and
leiden_libsize strategies). Default 50.
[default: 50]
--n-top-hvg INTEGER Number of highly variable genes/PAS selected
before PCA (leiden_libsize strategy only).
Default 2000. [default: 2000]
--benchmark No-op (use scripts/validate_strategies.py).
--validate-db PATH No-op (use scripts/validate_strategies.py).
--log-level TEXT Logger level or `name=LEVEL` (comma-
separated).
--no-log-file Don't write peakatail_<ts>.log next to
outputs.
--no-progress Suppress Rich progress bars.
--list-strategies Print available strategies and exit.
--help Show this message and exit.
Flags¶
Inputs¶
| Flag | Type | Default | Description |
|---|---|---|---|
--config / -c |
PATH | — | YAML config file. CLI flags override individual YAML keys. Requires datasets: block when used as the primary input method. |
--bam-dir |
PATH | — | Convenience flag: treats a single BAM file as a one-dataset run with id="default". Mutually exclusive with --bam-files when no --config is given. |
--bam-files |
TEXT | — | Comma-separated list of BAM paths for a one-dataset run. Same effect as --bam-dir but accepts multiple files. |
--gtf |
PATH | — | Ensembl or GENCODE GTF annotation file. Used for gene-end and UTR annotation of PAS sites. |
--atlas |
PATH | — | Reference PAS atlas BED for snapping detected PAS to known sites. |
--atlas-distance |
INT | 50 | Snap distance in bp: a detected PAS within this distance of an atlas entry is assigned the atlas coordinates. Reduce to 10 bp for high-precision snapping. |
--seq-len |
INT | 150 | Sequencing read length. Warns and defaults to 150 if not set. |
--cb-len |
INT | 16 | Cell-barcode length in bp. Warns and defaults to 16 if not set. |
--barcode-tag |
TEXT | CB | BAM tag carrying the cell barcode. Defaults to CB (Cell Ranger convention). |
Output¶
| Flag | Type | Default | Description |
|---|---|---|---|
--output / -o |
PATH | emaout |
Base name for the output directory. A timestamp is appended: peakatail_runs/<name>_<timestamp>/. When --config provides output_dir, the YAML value wins unless --output is explicitly set. |
Concurrency¶
| Flag | Type | Default | Description |
|---|---|---|---|
--threads |
INT | auto | Absolute ceiling passed to ResourceManager. Wired directly into the singleton before pipeline dispatch; omitting this flag lets ResourceManager detect available cores automatically. |
--bam-threads |
INT | 4 | pysam decompression threads per BAM file. Increase to 8–16 on machines with fast storage to reduce BAM-read I/O time. |
--batch-size |
INT | 10000 | Worker batch size when streaming reads in tile mode. Lower this if workers are hitting memory limits on large chromosomes. |
--tiles |
FLAG | off | Enable tile-based parallel peak calling. Splits chromosomes into overlapping tiles processed by a multiprocessing pool. Recommended for very large BAMs (>5 GB). |
--tile-size |
INT | auto | Tile size in bp. When unset, the pipeline picks a value based on chromosome lengths. |
--tile-overlap |
INT | 10000 | Overlap between adjacent tiles in bp. Overlap ensures PAS near tile boundaries are not missed. |
--pipeline |
FLAG | off | Pipeline mode; currently used only when --tiles is active. |
Peak calling¶
| Flag | Type | Default | Description |
|---|---|---|---|
--peak-strategy |
TEXT | lambda_gradient |
Algorithm used to call PAS peaks. Run ema run --list-strategies to see registered names. lambda_gradient is the recommended production strategy (highest precision in benchmark runs); pass --peak-strategy original for the unfiltered baseline. |
--lambda-window |
INT | 5000 | Window size in bp used to estimate local background signal lambda. Increase for sparse data where the default window may include too few reads. |
--lambda-method |
TEXT | median |
Estimator for lambda within the window. median is robust to outliers; mean may be inflated by nearby peaks. |
--lambda-fold-change |
FLOAT | 2.0 | A region must exceed lambda * fold_change to be called as a peak. Raise to 3.0–4.0 to reduce false positives in noisy data. |
--max-pas |
INT | 5 | Maximum PAS sites retained per peak region. Rarely needs changing; increase for loci with complex alternative polyadenylation. |
--smoothing-window |
INT | 50 | Gaussian smoothing window in bp applied to per-strand coverage before peak finding. Larger values suppress noise at the cost of resolution. |
--min-prominence |
FLOAT | 5.0 | scipy.signal.find_peaks prominence threshold. Lower to 2.0 to rescue low-coverage PAS; raise to 10.0 to keep only prominent peaks. |
--dynamic-threshold |
FLAG | off | Use a per-window dynamic peak height threshold rather than a fixed cutoff. Useful for samples with highly variable library depth across chromosomes. |
--floor-threshold |
INT | 3 | When --dynamic-threshold is on, this is the minimum peak height. Prevents the dynamic threshold from falling so low that noise is called. |
--pas-gap |
INT | 100 | Minimum bp gap between two PAS within the same peak. Increase to merge closely-spaced PAS that likely represent the same site. |
--min-pas-spacing |
INT | -1 |
Tier-1 (distance) of the post-detection PAS merger. Adjacent PAS within one peak whose gap < this value are merged unconditionally. -1 auto-detects the median read length per BAM (e.g. ~98 bp for 10x v2, ~150 bp for v3). 0 disables Tier 1. See Post-Detection PAS Merger. |
--min-pas-prominence |
FLOAT | 5.0 |
Tier-2 (valley depth) of the post-detection PAS merger. Lambda strategies (lambda_poisson, lambda_gradient) ignore this value and use their own compute_lambda(heights) instead — fully dynamic. Non-lambda strategies (original, sierra_iterative) treat this as a static coverage-depth threshold. Negative disables Tier 2. |
Filters¶
| Flag | Type | Default | Description |
|---|---|---|---|
--min-pas-per-cell |
INT | 50 | Minimum number of distinct PAS detected per cell barcode. Cells below this threshold are excluded. Also bridges to filter_config.min_genes in the legacy interface. |
--min-read |
INT | 1500 | Minimum total read count per cell barcode. Cells below this are discarded before count matrix construction. Reduce to 500 for low-depth protocols. |
--min-cells |
INT | 3 | Minimum number of cells expressing a PAS. PAS detected in fewer cells than this are removed from the matrix. |
--ip-filter |
FLAG | off | Currently a no-op. Will enable internal-priming filtering when integrated. Emits a warning if set. |
--genome-fasta |
PATH | — | Currently a no-op. Will supply the genome FASTA for the internal-priming filter. |
--annot-filter |
FLAG | off | Currently a no-op. Will enable annotation-based filtering. Emits a warning if set. |
--ip-a-stretch |
INT | 6 | Currently a no-op. A-stretch length for the internal-priming filter. |
Annotation¶
| Flag | Type | Default | Description |
|---|---|---|---|
--max-gene-distance |
INT | 5000 | Maximum distance in bp from a PAS to a gene 3' end for annotation assignment. PAS farther than this from any gene are left unannotated. |
--utr-multiplier |
FLOAT | 2.0 | The 3' UTR region is extended by utr_multiplier * annotated_UTR_length for the "extended-3'" annotation category. |
--include-extended |
FLAG | off | Include PAS falling in the extended 3' UTR region in the output. By default these are excluded. |
Clustering¶
| Flag | Type | Default | Description |
|---|---|---|---|
--cluster-method |
TEXT | leiden_tfidf |
Clustering algorithm. leiden_tfidf uses Signac Method 1 TF-IDF + ArchR-style LSI + Leiden. leiden_libsize uses library-size normalisation + HVG selection + PCA + Leiden. external reads labels from --external-clusters. |
--resolution |
FLOAT | 1.0 | Leiden resolution parameter. Higher values produce more, smaller clusters. Start at 0.5 for coarse cell types, increase to 1.5–2.0 to split sub-populations. |
--n-pcs |
INT | 40 | Number of PCA components used to build the kNN graph for Leiden. Reduce to 20 for small datasets (<1000 cells). |
--n-neighbors |
INT | 30 | Number of nearest neighbours in the kNN graph. leiden_tfidf defaults to 30; leiden_libsize defaults to 10 internally but uses this value when explicitly set. |
--n-svd-components |
INT | 50 | Number of SVD/TruncatedSVD components computed before depth-correlation filtering and kNN graph construction. |
--tfidf-scale-factor |
FLOAT | 10000 | Scale factor for Signac Method 1 TF-IDF normalisation (leiden_tfidf only). Rarely needs changing unless count distributions are unusually skewed. |
--depth-corr-threshold |
FLOAT | 0.75 | Pearson |r| threshold for removing LSI components correlated with sequencing depth (ArchR-style, leiden_tfidf only). Set to 1.0 to disable depth-correlation filtering entirely. |
--n-top-hvg |
INT | 2000 | Number of highly variable PAS selected before PCA (leiden_libsize only). |
--external-clusters |
PATH | — | Path to a TSV with barcode→cluster label assignments. Required when --cluster-method=external. |
--random-seed |
INT | 42 | RNG seed for Leiden and TruncatedSVD reproducibility. |
Cross-dataset matching¶
| Flag | Type | Default | Description |
|---|---|---|---|
--match-method |
TEXT | marker_overlap |
Strategy for assigning canonical cluster IDs across datasets. See ema switch match. |
--n-top-markers |
INT | 50 | Number of top marker PAS per cluster used by marker_overlap for cross-dataset Jaccard comparison. |
Logging and observability¶
| Flag | Type | Default | Description |
|---|---|---|---|
-v / --verbose |
COUNT | 0 | -v sets DEBUG for the ema.* logger namespace; -vv sets DEBUG everywhere including third-party libraries. |
-q / --quiet |
FLAG | off | Suppress INFO messages; show WARNING and above only. Overrides -v. |
--log-level |
TEXT | INFO | Global level string (DEBUG, INFO, WARNING, ERROR) or per-logger override syntax ema.clustering=DEBUG,ema.annotate=WARNING (comma-separated). |
--no-log-file |
FLAG | off | Do not write peakatail_<ts>.log inside the output directory. Useful for CI pipelines where logs are captured via stdout. |
--no-progress |
FLAG | off | Suppress Rich live-progress bars. Useful when running in non-interactive shells or capturing output. |
--list-strategies |
FLAG | off | Print all registered strategy names across the peak-calling, clustering, and match registries, then exit. |
Plotting¶
| Flag | Type | Default | Description |
|---|---|---|---|
--plot-engine |
TEXT | matplotlib |
Comma-separated list of rendering engines. matplotlib produces PNG/SVG; plotly produces interactive HTML; both produces all three; none disables plotting. |
--plot-format |
TEXT | all |
Restrict output formats: png, svg, html, or comma list. all writes whatever each engine supports by default. |
--no-plots |
FLAG | off | Alias for --plot-engine none. Disables all figure output and skips the viz hooks entirely. |
Validation flags (no-ops)¶
| Flag | Type | Default | Description |
|---|---|---|---|
--benchmark |
FLAG | off | No-op. Use scripts/validate_strategies.py instead. Emits a warning if set. |
--validate-db |
PATH | — | No-op. Use scripts/validate_strategies.py instead. Emits a warning if set. |
Output files¶
All paths below are relative to the run root
peakatail_runs/<name>_<timestamp>/.
per_dataset/<id>/raw/pos.bed and raw/neg.bed
: Raw peak-calling output for the positive and negative strands respectively, concatenated across BAM replicates. Unfiltered.
per_dataset/<id>/raw/pas.bed
: Union of pos.bed and neg.bed. BED6 format: chrom, start, end, pas_id, score, strand.
per_dataset/<id>/pasbed.bed
: Filtered and (optionally atlas-snapped) PAS coordinates. BED6. This is the canonical PAS BED consumed by ema switch diff and ema switch geneview. Source: ema/outputs.py::write_per_dataset_beds.
per_dataset/<id>/filtered_cb.tsv
: Single-column TSV of barcodes passing the --min-read filter. Header: barcode\tmin_read=<n>. Source: ema/outputs.py::write_filtered_cb.
per_dataset/<id>/annotated_matrix.mtx, annotated_pas_ids.tsv, annotated_cells.tsv
: MatrixMarket sparse count matrix (PAS rows, cell columns) after PAS-gene annotation. Row and column indices in the parallel TSV files. Source: ema/outputs.py::write_annotated_matrix.
per_dataset/<id>/preprocessed.h5ad
: Filtered AnnData before clustering. Inspect this to confirm the cell and PAS counts after quality filtering. Source: ema/outputs.py::write_preprocessed_h5ad.
per_dataset/<id>/clusters.h5ad
: Final AnnData with Leiden cluster labels in obs["leiden"] and gene annotations in var["gene_id"]. Primary input for all ema switch subcommands.
per_dataset/<id>/pas_gene.tsv
: Two-column TSV (pas_id, gene_id) mapping every PAS to its annotated gene. Source: ema/outputs.py::write_pas_gene_artifacts.
per_dataset/<id>/annotatedpas.bed
: Extends pasbed.bed with a trailing gene_id column. Useful for IGV inspection.
run_config.json
: Resolved run parameters with a timestamp. Used for reproducibility and by the pipeline viz hooks.
resources.jsonl
: Newline-delimited JSON records with schema {"elapsed_s": float, "rss_gb": float, "cpu_pct": float}. Written by a background _ResourceSampler thread at 5-second intervals. Requires psutil; silently absent if not installed.
tile_timings.json
: Per-tile peak-calling wall times. Only written when --tiles is active.
How it relates to other commands¶
After ema run completes, use the per-dataset clusters.h5ad and pasbed.bed
files as inputs to the switch subcommands:
ema switch diff— differential APA between Leiden cluster pairs.ema switch length— per-cluster PDUI / proportion / entropy quantification.ema switch match— align cluster identities across multiple datasets.ema switch geneview— visualise per-cluster PAS usage for specific genes.
See also¶
- Strategy overview in
../strategies/— detailed algorithm descriptions for peak calling and clustering. - Tutorial in
../tutorials/— step-by-step single-sample and multi-sample walkthroughs.