Cross-dataset matching strategies¶
Cross-dataset matching assigns canonical cluster IDs when the same experiment
is processed in multiple batches, or when you want to align cell populations
across independent samples. Each strategy reads a list of .h5ad files
(one per dataset), each of which must have obs["leiden"] populated by a
prior ema run or clustering step. The output is a DataFrame mapping every
(dataset_id, original_cluster) pair to a canonical_cluster integer.
For CLI usage see ../cli/switch-match.md.
All three strategies share the same output schema:
| Column | Type | Description |
|---|---|---|
dataset_id |
str | Human-readable dataset identifier |
original_cluster |
str | Leiden cluster label from that dataset |
canonical_cluster |
int | Canonical cluster ID shared across datasets |
match_confidence |
float | Mean Jaccard similarity for all edges within the component (0.0 for singletons) |
matched_to |
str | JSON list of [[dataset_id, original_cluster], ...] matched partners |
marker_overlap¶
Question: Which clusters in dataset A correspond to which clusters in dataset B based on the overlap of their top-ranked differentially expressed PAS?
Algorithm¶
- For each dataset, load the
.h5adfile and compute per-cluster marker PAS using Wilcoxon rank-sum (scanpy.tl.rank_genes_groups, method="wilcoxon"). The top-n_top_markersPAS by score form the cluster "fingerprint". If the data looks raw (integers, max value > 20), log-normalization is applied first. -
For each pair of datasets (A, B), compute the Jaccard similarity matrix between every cluster in A and every cluster in B:
-
Solve the maximum-weight 1-to-1 assignment using the Hungarian algorithm (
scipy.optimize.linear_sum_assignment) on the negated similarity matrix. Only pairs with Jaccard > 0 are accepted. - Build a graph where nodes are
(dataset_id, original_cluster)tuples and edges connect matched pairs. Extract connected components using union-find (transitive closure): if A.0 ~ B.2 and B.2 ~ C.5, all three share one canonical ID. - Assign canonical integer IDs (1, 2, ...) to each component. Unmatched
singletons get unique IDs.
match_confidenceis the mean Jaccard of all edges within the component.
Marker computation is parallelized across datasets via joblib.Parallel
(loky backend, per_worker_mb=200).
Inputs and outputs¶
Inputs:
h5ad_paths: list of paths to.h5adfilesdataset_ids: list of human-readable identifiers (same order)n_top_markers: number of top marker PAS per cluster
Outputs: canonical mapping DataFrame (schema above).
Tunable hyperparameters¶
| Parameter | Default | CLI flag | YAML key | Description |
|---|---|---|---|---|
n_top_markers |
50 | --n-top-markers |
n_top_markers |
Top-N marker PAS per cluster used as the cluster fingerprint. |
Interpretation¶
match_confidence near 1.0 indicates near-identical marker sets between
the matched clusters (strong biological correspondence). Values < 0.1 indicate
weak overlap; inspect whether the datasets share the same PAS feature space.
Unmatched clusters (confidence = 0.0) are singletons; they may represent
populations unique to that dataset.
A sanity check: run with n_top_markers=10 and n_top_markers=100; stable
matches (same canonical ID assignments) indicate robust correspondence.
Limitations¶
- Requires datasets to share a meaningful fraction of PAS features. Datasets
processed with different reference BED files or different capture
technologies may have few shared markers even for the same cell type;
use
mnnin that case. - Wilcoxon rank-sum marker computation inside scanpy requires the h5ad to have at least 2 clusters. Single-cluster datasets fall back to returning all features as markers.
- The Hungarian assignment enforces 1-to-1 matching per dataset pair. When one dataset has more clusters than another, the extra clusters become singletons.
mnn¶
Question: Which clusters in dataset A correspond to which clusters in dataset B based on how their cells link in a shared embedding space, even when marker gene overlap is low?
Algorithm¶
- Load all
.h5adfiles and concatenate their cell × feature matrices along the obs axis, intersecting to the common set of PAS features (var_names). Abatchcolumn records the dataset origin. RaisesValueErrorif no common features exist. - Log-normalize if the data looks raw, then run Truncated SVD (LSI) with
n_componentscomponents on the concatenated matrix. L2-normalize the embedding rows. - For each pair of datasets (A, B), find Mutual Nearest Neighbors in the shared embedding:
- For each cell in A, find the
k_neighborsnearest cells in B (cosine distance). - For each cell in B, find the
k_neighborsnearest cells in A. - A pair (i, j) is mutual if i is in B's neighbors of j AND j is in A's neighbors of i.
-
For each cluster A_i, count MNN votes to each cluster B_j. Compute:
-
Hungarian assignment + transitive closure, identical to
marker_overlap.
Inputs and outputs¶
Inputs:
h5ad_paths,dataset_idsn_components: LSI components for shared embeddingk_neighbors: MNN search neighborhood size
Outputs: canonical mapping DataFrame (schema above).
Tunable hyperparameters¶
| Parameter | Default | CLI flag | YAML key | Description |
|---|---|---|---|---|
n_components |
30 | --mnn-components |
mnn_components |
LSI components for the shared embedding. More components capture finer structure at higher cost. |
k_neighbors |
10 | --mnn-k-neighbors |
mnn_k_neighbors |
Nearest neighbors for MNN search. Larger values produce denser MNN graphs. |
Interpretation¶
MNN is most useful when datasets come from different protocols (e.g., Smart-seq2
and 10x Chromium) or have very different library sizes that reduce the marker
gene overlap. MNN match_confidence is the fraction of cluster A_i cells
whose dominant MNN partner is in cluster B_j; values above 0.3 indicate
reliable matches.
Check the shared embedding by inspecting whether cells from different datasets intermingle in a UMAP of the concatenated matrix. Strong batch separation in the embedding will cause false negatives (no MNN links across the boundary).
Limitations¶
- Re-embedding ALL datasets together is memory-intensive: memory scales with total cell count x n_features. For > 100,000 cells, consider subsampling.
- No batch correction (no Harmony, no Scanorama). Strong batch effects may dominate the embedding, causing biologically similar clusters to fail to link.
- All datasets must share the same PAS feature space (same unified PAS BED
file). Datasets processed with different BED files will have no common
features and the strategy raises
ValueError. - MNN search cost is O(n_cells_A * n_cells_B) per pair; for large datasets
this is slower than
marker_overlap.
jaccard¶
Question: Do the same physical cells appear in the same cluster across two runs of the same BAM (for regression testing or split-BAM experiments)?
Algorithm¶
- Load each
.h5adand build a mapping from cluster label to the set of stripped cell barcodes in that cluster. - Strip dataset prefixes from barcodes:
sample#ACGT-1→ACGT-1, andsampleA_ACGTGCATAGCT→ACGTGCATAGCT(only when the suffix looks like a real DNA barcode;cellA_0is kept as-is to avoid misidentifying synthetic test names). -
For each pair of datasets (A, B), compute the Jaccard similarity matrix between cluster barcode sets:
-
Hungarian assignment + transitive closure, identical to
marker_overlap.
Inputs and outputs¶
Inputs:
h5ad_paths,dataset_idsn_top_markers,n_jobs: accepted for interface compatibility but unused
Outputs: canonical mapping DataFrame (schema above).
Tunable hyperparameters¶
None. This strategy performs pure set operations on barcodes; there are no algorithm parameters.
Interpretation¶
When the same physical cells are clustered twice (deterministic run with the
same seed), match_confidence should approach 1.0. Values below 0.9 for
identical runs indicate non-determinism in clustering (e.g., different igraph
versions) or barcode format mismatches. Use match_confidence as a
reproducibility metric.
Limitations¶
- Biologically distinct samples will share no barcodes, producing no matches.
Do not use
jaccardfor cross-sample comparison; usemarker_overlapormnninstead. - The DNA-barcode heuristic (
[ACGTN]{6,}(-\d+)?) may incorrectly strip prefixes from synthetic test barcodes that happen to contain long DNA-like strings. Verify barcode stripping on your dataset with a small test before relying on it.