ema parse-gtf¶
Pre-parse and cache an Ensembl-style GTF so subsequent ema run / ema switch length --isoform-agg per_isoform invocations skip the expensive multi-minute GTF parse and load from disk in milliseconds.
The cache lives at ~/.cache/peakatail/gtf/<gtf-hash>/ by default. Both ema run and ema switch length look it up automatically — parse-gtf is just a way to warm it explicitly (e.g. as part of an environment-setup script, or before a Docker build).
When to use it
- First-time setup for a fresh environment — warm the cache so the first real run doesn't pay the parse cost.
- CI / Docker images that bake the cache into the layer so containers start instantly.
- You changed GTFs (e.g. Ensembl version bump) and want to confirm the new file parses cleanly before the pipeline tries to use it.
When NOT to use it
- You're running a one-off pipeline once —
ema runwill populate the cache anyway on first invocation. Pre-warming has no benefit.
Quick examples¶
Pre-parse one GTF:
List existing cache entries:
Force re-parse on cache hit (e.g. after a code change):
Full --help output¶
Usage: ema parse-gtf [OPTIONS]
Pre-warm the GTF cache so subsequent runs hit instantly.
Options:
--threads INT Max parallel workers (parse is multiprocessed).
-v, --verbose Increase verbosity.
-q, --quiet WARNING and up only.
--log-level TEXT Explicit logger level.
--no-progress Suppress Rich progress bars.
-c, --config FILE YAML config.
-g, --gtf PATH GTF file to pre-parse (required unless --show-cache).
--cache-dir PATH Override the global ~/.cache/peakatail/gtf/ location.
--force Re-parse even on cache hit.
--show-cache List cache entries with sizes/age and exit.
-h, --help Show this message and exit.
Flags¶
| Flag | Type | Default | Description |
|---|---|---|---|
--gtf / -g |
PATH | — | GTF to pre-parse. Required unless --show-cache is used. |
--cache-dir |
PATH (dir) | ~/.cache/peakatail/gtf/ |
Override the cache location (e.g. for a CI runner with limited home-dir size). |
--force |
flag | False | Re-parse and overwrite the cache entry even on hit. |
--show-cache |
flag | False | List cached GTFs with size + age, then exit. |
--threads |
INT | auto | Parallel workers for the parse (parse_isoform_utrs uses up to min(--threads, 4) chromosome workers). |
Plus the common logging flags (-v, -q, --log-level, --no-progress, --config). No --output flag — output is the cache directory.
Output files¶
The parse writes to <cache-dir>/<gtf-content-hash>/:
<hash>/transcripts.parquet— per-transcript exon + 3'UTR table<hash>/source.txt— original GTF path + hash for provenance<hash>/meta.json— parse timestamp, peakatail version, transcript count
Future invocations of ema run / ema switch length compute the same content-hash and skip directly to loading the parquet.
How it relates to other commands¶
The cache is consumed transparently by:
ema run— when annotating peaks against transcripts.ema switch length --isoform-agg per_isoform— when mapping PAS to isoform UTRs.ema switch geneview --gtf <gtf>— when rendering the isoform structure track.
No manual flag is required to use a cached entry; PeakATail computes the hash and looks it up automatically.
See also¶
ema runema switch length— per-isoform modeema switch geneview— isoform overlay