Nagarjun's blog: False Positives: How to Avoid Seeing APOBEC Everywhere

Tuesday, June 30, 2026

False Positives: How to Avoid Seeing APOBEC Everywhere

“cannot be attributed to random mutagenesis”

Source: Carmi, Church, and Levanon

The genome is full of repeats, transitions, alignment ambiguity, and local sequence biases. Any large-scale scan will find striking patterns somewhere. A serious APOBEC detector must therefore be built like a paranoid little machine, constantly asking: what else could generate this pattern?

False positives can come from at least seven sources.

First, ordinary mutation. Over millions of years, every repeat copy accumulates substitutions. If two copies are old enough, many G-to-A differences will appear without any burst process. This is why same-subfamily comparisons and cluster thresholds are important. The detector should avoid deeply diverged alignments where the background mutation fog is thick.

Second, CpG deamination. Methylated CpG sites mutate readily, producing C-to-T changes on one strand and G-to-A on the other. If the analysis ignores CpG context, some ordinary methylation-driven transitions could mimic APOBEC. APOBEC motif analysis helps, but a robust model should explicitly account for CpG-associated transitions.

Third, alignment artefacts. Repeats are hard to align. Indels, low-complexity segments, internal duplications, and tandem repeats can create apparent clusters of mismatches. Filtering should remove low-quality alignment blocks, require sufficient aligned length, and exclude regions dominated by gaps or simple sequences.

Fourth, assembly error. Older draft genomes, high-copy regions, and collapsed repeats can create spurious differences or erase real ones. Comparative studies across many non-model species are especially vulnerable. Assembly quality should be included as a covariate, and high-confidence examples should be checked against independent assemblies where available.

Fifth, gene conversion. Homologous repeats can exchange sequence after insertion. This can make copies look younger than they are or create patchy similarity that confuses source-copy inference. Local phylogenetic inconsistency is a warning sign.

Sixth, duplicate descent. If one edited element is duplicated, all descendants inherit the edited sites. Counting each descendant as an independent APOBEC event inflates estimates. This is particularly dangerous in recent expansions, segmental duplications, and lineage-specific bursts.

Seventh, orientation mistakes. If the repeat orientation is wrong or if the analyzed strand does not match the biologically meaningful sense strand, the expected G-to-A versus C-to-T asymmetry can flip or weaken.

The best published screens use several controls. Knisbacher and Levanon compared G-to-A clusters against C-to-T mirror events, looked at DNA transposons as a non-target class, and used invertebrates as APOBEC-poor controls. These controls operate at different levels: strand specificity, substrate specificity, and organismal biology. Their agreement makes the APOBEC interpretation much stronger.

A useful modern extension is simulation. For every candidate alignment, simulate mutations under a model preserving alignment length, base composition, CpG density, local divergence, and transition/transversion ratio. Then ask how often a cluster as dense and motif-biased as the observed one appears by chance. This provides a locus-level empirical p-value rather than a global threshold only.

Another extension is mixture modelling. Rather than classify every mismatch as APOBEC or background, model the alignment as a mixture of a background substitution process plus a burst component. The burst component should enrich for G-to-A, be spatially clustered, and prefer APOBEC motifs. The output becomes a posterior probability per site and per element.

Yet another improvement is replication across evidence types. The strongest candidates satisfy multiple independent tests: they are in LTR or retroviral elements, show G-to-A clusters, have APOBEC motif enrichment, pass consensus directionality, lack comparable C-to-T clusters, are young or species-specific, and contain ORF-disrupting edits such as stop codons in TGG tryptophan codons.

The goal is not to eliminate all uncertainty. Ancient sequence reconstruction cannot do that. The goal is to prevent a single seductive pattern from doing all the argumentative work. APOBEC inference should be cumulative, like a lock that needs several tumblers to click before the door opens.

Key technical takeaway: A robust APOBEC screen needs negative controls, strand controls, motif controls, alignment-quality filters, and copy-descent correction. Otherwise, repeat-rich genomes will happily manufacture false thunder.

Tuesday, June 30, 2026

False Positives: How to Avoid Seeing APOBEC Everywhere

No comments: