Saturday, June 27, 2026

The Fossil Genome: Why Repeats Can Preserve Ancient Editing

 “fossil record”

Source: paleovirology literature on endogenous retroviruses

The first step in detecting APOBEC editing in repeat elements is changing how we think about the genome. A repeat element is not only a sequence annotation, a RepeatMasker row, or a nuisance in a mapping pipeline. It can also be a historical object. Endogenous retroviruses, LTR retrotransposons, LINEs, SINEs, and SVA elements preserve molecular events that occurred while mobile DNA was copying itself, invading germline genomes, or being restrained by host defense proteins.

This is why paleovirology papers often describe endogenous retroviruses as a fossil record. A provirus integrated into the germline can be inherited vertically. Over time, it accumulates ordinary substitutions, deletions, recombination events, and disabling mutations. But if the retroviral cDNA was attacked by APOBEC before integration, the integrated copy may also preserve a burst of cytidine deamination, visible later as clusters of G-to-A substitutions on the plus strand.

That immediately raises the central technical question of the whole series: how do we tell a burst from a clock?

Ordinary neutral evolution produces substitutions over time. Some classes of substitution are more common than others, and CpG deamination can create abundant C-to-T changes. APOBEC editing is different in three ways. First, it is clustered. Many mutations appear in a short segment of a single element. Second, it is directional. In the relevant orientation, APOBEC activity produces G-to-A changes in the retroelement sense strand because cytosines were deaminated on the complementary strand during reverse transcription. Third, it is motif-biased. Different APOBEC enzymes prefer different local nucleotide contexts, such as signatures often discussed as APOBEC3G-like or APOBEC3F-like.

The genome therefore gives us a forensic problem. We do not observe the ancient enzyme. We observe extant sequence copies. We then reconstruct a likely ancestral state, usually using a subfamily consensus, a closely related unedited copy, orthologous loci in related species, or a phylogenetic model. If one copy carries many A bases where its putative source and consensus carry G bases, and if those differences are clustered and motif-biased, the case for APOBEC editing becomes strong.

The dating problem is more delicate. A G-to-A cluster does not contain a calendar date. Most studies estimate the date of the repeat insertion or repeat expansion, then infer that APOBEC editing occurred before or around integration. For LTR retrotransposons and endogenous retroviruses, editing is usually placed during reverse transcription. For non-LTR retrotransposons, the relevant exposure of single-stranded DNA occurs during target-primed reverse transcription or related replication intermediates, but deaminase-dependent signatures are not always the dominant restriction mechanism.

A useful conceptual model is to split the problem into four layers.

First, there is the biochemical layer: could APOBEC plausibly generate this pattern?

Second, there is the alignment layer: can we infer which base is ancestral and which is derived?

Third, there is the population or phylogenetic layer: when did this repeat copy appear relative to species splits, subfamily expansion, or polymorphism?

Fourth, there is the ecological layer: what repeat families were active, and what APOBEC genes existed, expanded, or diversified in the host lineage at that time?

Most errors arise when these layers are collapsed. A study may robustly detect edited copies but not independently date each editing event. A study may date an ERV invasion but not show that every copy in the family was edited. A study may observe many edited copies but not distinguish independent APOBEC attacks from descendants of one edited source. Good interpretation keeps these quantities separate.

This series follows the whole pipeline: signature detection, parent-child inference, consensus filters, species-specific dating, recent expansion bias, APOBEC gene copy number, species trends, functional assays, arms-race interpretation, and broader genomic impact. The genome is a fossil bed, but the fossils are shattered, copied, nested, and sometimes copied again. Reading them requires both statistical caution and a taste for molecular archaeology.

Key technical takeaway: APOBEC repeat editing is usually dated indirectly. The edit is inferred from clustered, directional, motif-biased substitutions; the date is inferred from insertion age, species distribution, repeat-family history, or LTR divergence.

No comments: