`ema run`¶

ema run executes the complete PeakATail pipeline: it reads one or more BAM files, calls poly(A) sites (PAS) per strand, builds cell-by-PAS count matrices, annotates PAS with gene identities from a GTF file, filters low-quality barcodes, and clusters cells using TF-IDF + LSI + Leiden. One clusters.h5ad file is produced per dataset and placed inside a timestamped output directory. All switch subcommands (diff, length, match, geneview) consume these files as their primary input.

When to use it

You have one or more 10x Chromium scRNA-seq BAM files and want to discover PAS and cluster cells by 3' UTR usage.
You want a single command that produces all intermediate files (BEDs, count matrices, AnnData) for downstream analysis.
You are running a new sample or reprocessing with changed peak-calling or clustering parameters.

When NOT to use it

You already have clusters.h5ad files from a previous run and only want to test differential APA. Use ema switch diff directly.
You want to change only the clustering resolution without re-running peak calling. Re-running ema run repeats the entire pipeline; there is currently no checkpoint resume.

Quick example¶

uv run ema run \
  --config example.yaml \
  --threads 8 \
  --peak-strategy lambda_gradient \
  --cluster-method leiden_tfidf \
  --resolution 0.8 \
  -vv

After this command completes, the following files are on disk (example with --output emaout and dataset id sample1):

peakatail_runs/emaout_2026-05-11_120000/per_dataset/sample1/clusters.h5ad — AnnData with Leiden cluster labels in obs["leiden"] and gene annotations in var["gene_id"].
peakatail_runs/emaout_2026-05-11_120000/per_dataset/sample1/pasbed.bed — Filtered BED6 PAS coordinates consumed by switch diff and switch geneview.
peakatail_runs/emaout_2026-05-11_120000/per_dataset/sample1/pas_gene.tsv — Two-column TSV mapping every pas_id to a gene_id.
peakatail_runs/emaout_2026-05-11_120000/run_config.json — Resolved parameters for reproducibility.
peakatail_runs/emaout_2026-05-11_120000/resources.jsonl — Per-second RSS + CPU samples (requires psutil).

Full `--help` output¶

Usage: ema run [OPTIONS]

  Run the full PeakATail pipeline.

Options:
  --threads INTEGER               Max parallel workers (auto-detected if not
                                  set). Respected by ResourceManager as an
                                  absolute ceiling.
  -v, --verbose                   Increase verbosity. -v = DEBUG for ema.*;
                                  -vv = DEBUG everywhere.
  -q, --quiet                     WARNING and up only. Overrides --verbose.
  --log-level TEXT                Explicit logger level (DEBUG/INFO/WARNING
                                  /ERROR) or `logger.name=LEVEL` (repeatable:
                                  comma-separated).
  --no-log-file                   Don't write peakatail_<ts>.log next to the
                                  outputs.
  --no-progress                   Suppress Rich progress bars.
  -c, --config PATH               YAML config; CLI flags override individual
                                  keys.
  -o, --output PATH               Output directory (timestamp suffix added
                                  automatically).
  --plot-engine TEXT              Engines: 'matplotlib' (default), 'plotly',
                                  'both', 'none', or comma list.
  --plot-format TEXT              Restrict output formats. Default 'all' =
                                  png+svg+html as appropriate. Examples: 'svg'
                                  / 'png,svg' / 'html'.
  --no-plots                      Disable all plotting (alias for --plot-
                                  engine none).
  --bam-dir PATH                  Single-BAM convenience.
  --bam-files TEXT                Comma-separated multi-BAM list.
  --gtf PATH                      GTF annotation file.
  --atlas PATH                    Reference PAS atlas BED.
  --atlas-distance INTEGER        Atlas snap distance (bp).  [default: 50]
  --seq-len INTEGER               Sequencing read length.
  --cb-len INTEGER                Cell-barcode length (bp).
  --barcode-tag TEXT              BAM tag holding the cell barcode (default
                                  CB).
  --bam-threads INTEGER           pysam decompression threads per BAM.
                                  [default: 4]
  --pipeline                      Pipeline mode (currently only used by
                                  --tiles).
  --batch-size INTEGER            Worker batch size for streaming reads.
                                  [default: 10000]
  --tiles                         Enable tile-based peak calling (parallel
                                  pool).
  --tile-size INTEGER             Tile size in bp (auto if unset).
  --tile-overlap INTEGER          Tile overlap in bp.  [default: 10000]
  --peak-strategy TEXT            Peak-calling strategy (run --list-strategies
                                  to see).  [default: lambda_gradient]
  --lambda-window INTEGER         Background lambda estimation window (bp).
                                  [default: 5000]
  --lambda-method TEXT            Lambda estimator (median / mean / ...).
                                  [default: median]
  --lambda-fold-change FLOAT      Lambda fold-change cutoff for peak
                                  detection.  [default: 2.0]
  --max-pas INTEGER               Maximum PAS sites kept per peak.  [default:
                                  5]
  --smoothing-window INTEGER      Coverage smoothing window (bp).  [default:
                                  50]
  --min-prominence FLOAT          scipy.signal.find_peaks prominence
                                  threshold.  [default: 5.0]
  --dynamic-threshold             Use a per-window dynamic peak threshold.
  --floor-threshold INTEGER       Minimum peak height (clamps dynamic
                                  threshold).  [default: 3]
  --pas-gap INTEGER               Minimum gap between PAS within a peak (bp).
                                  [default: 100]
  --min-pas-spacing INTEGER       Tier-1 (distance) of the post-detection PAS
                                  merger.  Adjacent PAS within one peak whose
                                  gap is below this value are merged uncon-
                                  ditionally.  -1 (default) auto-detects the
                                  median read length per BAM; 0 disables the
                                  distance tier.  [default: -1]
  --min-pas-prominence FLOAT      Tier-2 (valley) static fallback for the
                                  post-detection PAS merger.  Lambda strategies
                                  (lambda_poisson, lambda_gradient) ignore
                                  this and use their own compute_lambda(...).
                                  Non-lambda strategies (original,
                                  sierra_iterative) use this value as the
                                  valley-depth threshold.  Negative disables
                                  Tier 2.  [default: 5.0]
  --ip-filter                     Enable internal-priming filter (currently
                                  no-op).
  --genome-fasta PATH             Genome FASTA for --ip-filter (currently no-
                                  op).
  --annot-filter                  Annotation filter (currently no-op).
  --ip-a-stretch INTEGER          A-stretch length for --ip-filter (currently
                                  no-op).  [default: 6]
  --min-pas-per-cell INTEGER      Minimum PAS per cell (also bridges to
                                  filter_config.min_genes).  [default: 50]
  --min-read INTEGER              Minimum reads per cell barcode.  [default:
                                  1500]
  --min-cells INTEGER             Minimum cells expressing a PAS.  [default:
                                  3]
  --max-gene-distance INTEGER     Max distance for gene-end annotation (bp).
                                  [default: 5000]
  --utr-multiplier FLOAT          3'UTR length multiplier for extended-3'
                                  annotation.  [default: 2.0]
  --include-extended              Include extended-3' annotations.
  --cluster-method TEXT           Clustering strategy (leiden_tfidf /
                                  leiden_libsize / external).  [default:
                                  leiden_tfidf]
  --resolution FLOAT              Leiden resolution.  [default: 1.0]
  --n-pcs INTEGER                 Number of principal components.  [default:
                                  40]
  --external-clusters PATH        Path to external cluster labels TSV.
  --random-seed INTEGER           RNG seed for clustering reproducibility.
                                  [default: 42]
  --match-method TEXT             Cross-dataset cluster match strategy.
                                  [default: marker_overlap]
  --n-top-markers INTEGER         Number of top marker PAS per cluster.
                                  [default: 50]
  --n-neighbors INTEGER           Number of nearest neighbours for the kNN
                                  graph used by Leiden (leiden_tfidf default:
                                  30; leiden_libsize default: 10 — set
                                  --n-neighbors explicitly to override).
                                  [default: 30]
  --tfidf-scale-factor FLOAT      Scale factor for Signac Method 1 TF-IDF
                                  (leiden_tfidf strategy). Default 10000.
                                  [default: 10000.0]
  --depth-corr-threshold FLOAT    Pearson |r| threshold for removing LSI
                                  components correlated with sequencing depth
                                  (leiden_tfidf, ArchR-style). Default 0.75.
                                  [default: 0.75]
  --n-svd-components INTEGER      Number of SVD/PCA components computed before
                                  filtering/neighbor graph (leiden_tfidf and
                                  leiden_libsize strategies). Default 50.
                                  [default: 50]
  --n-top-hvg INTEGER             Number of highly variable genes/PAS selected
                                  before PCA (leiden_libsize strategy only).
                                  Default 2000.  [default: 2000]
  --benchmark                     No-op (use scripts/validate_strategies.py).
  --validate-db PATH              No-op (use scripts/validate_strategies.py).
  --log-level TEXT                Logger level or `name=LEVEL` (comma-
                                  separated).
  --no-log-file                   Don't write peakatail_<ts>.log next to
                                  outputs.
  --no-progress                   Suppress Rich progress bars.
  --list-strategies               Print available strategies and exit.
  --help                          Show this message and exit.

Flags¶

Inputs¶

Flag	Type	Default	Description
`--config` / `-c`	PATH	—	YAML config file. CLI flags override individual YAML keys. Requires `datasets:` block when used as the primary input method.
`--bam-dir`	PATH	—	Convenience flag: treats a single BAM file as a one-dataset run with `id="default"`. Mutually exclusive with `--bam-files` when no `--config` is given.
`--bam-files`	TEXT	—	Comma-separated list of BAM paths for a one-dataset run. Same effect as `--bam-dir` but accepts multiple files.
`--gtf`	PATH	—	Ensembl or GENCODE GTF annotation file. Used for gene-end and UTR annotation of PAS sites.
`--atlas`	PATH	—	Reference PAS atlas BED for snapping detected PAS to known sites.
`--atlas-distance`	INT	50	Snap distance in bp: a detected PAS within this distance of an atlas entry is assigned the atlas coordinates. Reduce to 10 bp for high-precision snapping.
`--seq-len`	INT	150	Sequencing read length. Warns and defaults to 150 if not set.
`--cb-len`	INT	16	Cell-barcode length in bp. Warns and defaults to 16 if not set.
`--barcode-tag`	TEXT	CB	BAM tag carrying the cell barcode. Defaults to `CB` (Cell Ranger convention).

Output¶

Flag	Type	Default	Description
`--output` / `-o`	PATH	`emaout`	Base name for the output directory. A timestamp is appended: `peakatail_runs/<name>_<timestamp>/`. When `--config` provides `output_dir`, the YAML value wins unless `--output` is explicitly set.

Concurrency¶

Flag	Type	Default	Description
`--threads`	INT	auto	Absolute ceiling passed to `ResourceManager`. Wired directly into the singleton before pipeline dispatch; omitting this flag lets `ResourceManager` detect available cores automatically.
`--bam-threads`	INT	4	pysam decompression threads per BAM file. Increase to 8–16 on machines with fast storage to reduce BAM-read I/O time.
`--batch-size`	INT	10000	Worker batch size when streaming reads in tile mode. Lower this if workers are hitting memory limits on large chromosomes.
`--tiles`	FLAG	off	Enable tile-based parallel peak calling. Splits chromosomes into overlapping tiles processed by a multiprocessing pool. Recommended for very large BAMs (>5 GB).
`--tile-size`	INT	auto	Tile size in bp. When unset, the pipeline picks a value based on chromosome lengths.
`--tile-overlap`	INT	10000	Overlap between adjacent tiles in bp. Overlap ensures PAS near tile boundaries are not missed.
`--pipeline`	FLAG	off	Pipeline mode; currently used only when `--tiles` is active.

Peak calling¶

Flag	Type	Default	Description
`--peak-strategy`	TEXT	`lambda_gradient`	Algorithm used to call PAS peaks. Run `ema run --list-strategies` to see registered names. `lambda_gradient` is the recommended production strategy (highest precision in benchmark runs); pass `--peak-strategy original` for the unfiltered baseline.
`--lambda-window`	INT	5000	Window size in bp used to estimate local background signal lambda. Increase for sparse data where the default window may include too few reads.
`--lambda-method`	TEXT	`median`	Estimator for lambda within the window. `median` is robust to outliers; `mean` may be inflated by nearby peaks.
`--lambda-fold-change`	FLOAT	2.0	A region must exceed `lambda * fold_change` to be called as a peak. Raise to 3.0–4.0 to reduce false positives in noisy data.
`--max-pas`	INT	5	Maximum PAS sites retained per peak region. Rarely needs changing; increase for loci with complex alternative polyadenylation.
`--smoothing-window`	INT	50	Gaussian smoothing window in bp applied to per-strand coverage before peak finding. Larger values suppress noise at the cost of resolution.
`--min-prominence`	FLOAT	5.0	`scipy.signal.find_peaks` prominence threshold. Lower to 2.0 to rescue low-coverage PAS; raise to 10.0 to keep only prominent peaks.
`--dynamic-threshold`	FLAG	off	Use a per-window dynamic peak height threshold rather than a fixed cutoff. Useful for samples with highly variable library depth across chromosomes.
`--floor-threshold`	INT	3	When `--dynamic-threshold` is on, this is the minimum peak height. Prevents the dynamic threshold from falling so low that noise is called.
`--pas-gap`	INT	100	Minimum bp gap between two PAS within the same peak. Increase to merge closely-spaced PAS that likely represent the same site.
`--min-pas-spacing`	INT	`-1`	Tier-1 (distance) of the post-detection PAS merger. Adjacent PAS within one peak whose gap < this value are merged unconditionally. `-1` auto-detects the median read length per BAM (e.g. ~98 bp for 10x v2, ~150 bp for v3). `0` disables Tier 1. See Post-Detection PAS Merger.
`--min-pas-prominence`	FLOAT	`5.0`	Tier-2 (valley depth) of the post-detection PAS merger. Lambda strategies (`lambda_poisson`, `lambda_gradient`) ignore this value and use their own `compute_lambda(heights)` instead — fully dynamic. Non-lambda strategies (`original`, `sierra_iterative`) treat this as a static coverage-depth threshold. Negative disables Tier 2.

Filters¶

Flag	Type	Default	Description
`--min-pas-per-cell`	INT	50	Minimum number of distinct PAS detected per cell barcode. Cells below this threshold are excluded. Also bridges to `filter_config.min_genes` in the legacy interface.
`--min-read`	INT	1500	Minimum total read count per cell barcode. Cells below this are discarded before count matrix construction. Reduce to 500 for low-depth protocols.
`--min-cells`	INT	3	Minimum number of cells expressing a PAS. PAS detected in fewer cells than this are removed from the matrix.
`--ip-filter`	FLAG	off	Currently a no-op. Will enable internal-priming filtering when integrated. Emits a warning if set.
`--genome-fasta`	PATH	—	Currently a no-op. Will supply the genome FASTA for the internal-priming filter.
`--annot-filter`	FLAG	off	Currently a no-op. Will enable annotation-based filtering. Emits a warning if set.
`--ip-a-stretch`	INT	6	Currently a no-op. A-stretch length for the internal-priming filter.

Annotation¶

Flag	Type	Default	Description
`--max-gene-distance`	INT	5000	Maximum distance in bp from a PAS to a gene 3' end for annotation assignment. PAS farther than this from any gene are left unannotated.
`--utr-multiplier`	FLOAT	2.0	The 3' UTR region is extended by `utr_multiplier * annotated_UTR_length` for the "extended-3'" annotation category.
`--include-extended`	FLAG	off	Include PAS falling in the extended 3' UTR region in the output. By default these are excluded.

Clustering¶

Flag	Type	Default	Description
`--cluster-method`	TEXT	`leiden_tfidf`	Clustering algorithm. `leiden_tfidf` uses Signac Method 1 TF-IDF + ArchR-style LSI + Leiden. `leiden_libsize` uses library-size normalisation + HVG selection + PCA + Leiden. `external` reads labels from `--external-clusters`.
`--resolution`	FLOAT	1.0	Leiden resolution parameter. Higher values produce more, smaller clusters. Start at 0.5 for coarse cell types, increase to 1.5–2.0 to split sub-populations.
`--n-pcs`	INT	40	Number of PCA components used to build the kNN graph for Leiden. Reduce to 20 for small datasets (<1000 cells).
`--n-neighbors`	INT	30	Number of nearest neighbours in the kNN graph. `leiden_tfidf` defaults to 30; `leiden_libsize` defaults to 10 internally but uses this value when explicitly set.
`--n-svd-components`	INT	50	Number of SVD/TruncatedSVD components computed before depth-correlation filtering and kNN graph construction.
`--tfidf-scale-factor`	FLOAT	10000	Scale factor for Signac Method 1 TF-IDF normalisation (`leiden_tfidf` only). Rarely needs changing unless count distributions are unusually skewed.
`--depth-corr-threshold`	FLOAT	0.75	Pearson \|r\| threshold for removing LSI components correlated with sequencing depth (ArchR-style, `leiden_tfidf` only). Set to 1.0 to disable depth-correlation filtering entirely.
`--n-top-hvg`	INT	2000	Number of highly variable PAS selected before PCA (`leiden_libsize` only).
`--external-clusters`	PATH	—	Path to a TSV with barcode→cluster label assignments. Required when `--cluster-method=external`.
`--random-seed`	INT	42	RNG seed for Leiden and TruncatedSVD reproducibility.

Cross-dataset matching¶

Flag	Type	Default	Description
`--match-method`	TEXT	`marker_overlap`	Strategy for assigning canonical cluster IDs across datasets. See `ema switch match`.
`--n-top-markers`	INT	50	Number of top marker PAS per cluster used by `marker_overlap` for cross-dataset Jaccard comparison.

Logging and observability¶

Flag	Type	Default	Description
`-v` / `--verbose`	COUNT	0	`-v` sets DEBUG for the `ema.*` logger namespace; `-vv` sets DEBUG everywhere including third-party libraries.
`-q` / `--quiet`	FLAG	off	Suppress INFO messages; show WARNING and above only. Overrides `-v`.
`--log-level`	TEXT	INFO	Global level string (`DEBUG`, `INFO`, `WARNING`, `ERROR`) or per-logger override syntax `ema.clustering=DEBUG,ema.annotate=WARNING` (comma-separated).
`--no-log-file`	FLAG	off	Do not write `peakatail_<ts>.log` inside the output directory. Useful for CI pipelines where logs are captured via stdout.
`--no-progress`	FLAG	off	Suppress Rich live-progress bars. Useful when running in non-interactive shells or capturing output.
`--list-strategies`	FLAG	off	Print all registered strategy names across the peak-calling, clustering, and match registries, then exit.

Plotting¶

Flag	Type	Default	Description
`--plot-engine`	TEXT	`matplotlib`	Comma-separated list of rendering engines. `matplotlib` produces PNG/SVG; `plotly` produces interactive HTML; `both` produces all three; `none` disables plotting.
`--plot-format`	TEXT	`all`	Restrict output formats: `png`, `svg`, `html`, or comma list. `all` writes whatever each engine supports by default.
`--no-plots`	FLAG	off	Alias for `--plot-engine none`. Disables all figure output and skips the viz hooks entirely.

Validation flags (no-ops)¶

Flag	Type	Default	Description
`--benchmark`	FLAG	off	No-op. Use `scripts/validate_strategies.py` instead. Emits a warning if set.
`--validate-db`	PATH	—	No-op. Use `scripts/validate_strategies.py` instead. Emits a warning if set.

Output files¶

All paths below are relative to the run root peakatail_runs/<name>_<timestamp>/.

per_dataset/<id>/raw/pos.bed and raw/neg.bed : Raw peak-calling output for the positive and negative strands respectively, concatenated across BAM replicates. Unfiltered.

per_dataset/<id>/raw/pas.bed : Union of pos.bed and neg.bed. BED6 format: chrom, start, end, pas_id, score, strand.

per_dataset/<id>/pasbed.bed : Filtered and (optionally atlas-snapped) PAS coordinates. BED6. This is the canonical PAS BED consumed by ema switch diff and ema switch geneview. Source: ema/outputs.py::write_per_dataset_beds.

per_dataset/<id>/filtered_cb.tsv : Single-column TSV of barcodes passing the --min-read filter. Header: barcode\tmin_read=<n>. Source: ema/outputs.py::write_filtered_cb.

per_dataset/<id>/annotated_matrix.mtx, annotated_pas_ids.tsv, annotated_cells.tsv : MatrixMarket sparse count matrix (PAS rows, cell columns) after PAS-gene annotation. Row and column indices in the parallel TSV files. Source: ema/outputs.py::write_annotated_matrix.

per_dataset/<id>/preprocessed.h5ad : Filtered AnnData before clustering. Inspect this to confirm the cell and PAS counts after quality filtering. Source: ema/outputs.py::write_preprocessed_h5ad.

per_dataset/<id>/clusters.h5ad : Final AnnData with Leiden cluster labels in obs["leiden"] and gene annotations in var["gene_id"]. Primary input for all ema switch subcommands.

per_dataset/<id>/pas_gene.tsv : Two-column TSV (pas_id, gene_id) mapping every PAS to its annotated gene. Source: ema/outputs.py::write_pas_gene_artifacts.

per_dataset/<id>/annotatedpas.bed : Extends pasbed.bed with a trailing gene_id column. Useful for IGV inspection.

run_config.json : Resolved run parameters with a timestamp. Used for reproducibility and by the pipeline viz hooks.

resources.jsonl : Newline-delimited JSON records with schema {"elapsed_s": float, "rss_gb": float, "cpu_pct": float}. Written by a background _ResourceSampler thread at 5-second intervals. Requires psutil; silently absent if not installed.

tile_timings.json : Per-tile peak-calling wall times. Only written when --tiles is active.

How it relates to other commands¶

After ema run completes, use the per-dataset clusters.h5ad and pasbed.bed files as inputs to the switch subcommands:

ema switch diff — differential APA between Leiden cluster pairs.
ema switch length — per-cluster PDUI / proportion / entropy quantification.
ema switch match — align cluster identities across multiple datasets.
ema switch geneview — visualise per-cluster PAS usage for specific genes.

ema run¶