“source sequences”
Source: Carmi, Church, and Levanon
The hardest part of detecting ancient editing is deciding which sequence state is ancestral. Suppose two repeat copies differ at a position: one has G, and the other has A. Calling that an APOBEC edit assumes the change went from G to A. But sequence alignments alone do not give direction. A-to-G is also a possible transition. Without directionality, an APOBEC detector is just counting differences.
The common solution is to use a parent-child or source-edited model. The idea is that a newly inserted edited element should resemble the element that produced it, except at APOBEC-induced sites. If a genomic copy contains many A bases where a highly similar partner contains G bases, the G-rich partner becomes a candidate source or ancestral proxy. The A-rich copy becomes the candidate edited descendant.
Knisbacher and Levanon formalised this using the same-subfamily LTR alignments and a consensus filter. They first identified candidate pairwise alignments with clustered G-to-A differences. Then they asked whether the subfamily consensus supported the G state at those sites. If most candidate editing positions are G in the consensus, and the A-containing element is more diverged from the consensus than the G-containing element, then the direction G-to-A becomes much more plausible.
This design is elegant because it creates a local evolutionary triangle: candidate source copy, candidate edited copy, and subfamily consensus. If all three agree with the edit model, the inference is strong. If the consensus is ambiguous or supports A, the case weakens. If the A-containing copy is not more diverged from consensus, the candidate may be a false directional call.
But the model has assumptions. First, it assumes that a close source or source-like element still exists in the assembly. That may fail if the actual source was deleted, rearranged, incompletely assembled, or itself highly mutated. Second, it assumes the subfamily consensus is a reasonable ancestral approximation. That may fail for rapidly expanding, structured, or recombining repeat families. Third, it assumes that high similarity indicates ancestry rather than recent duplication, gene conversion, or assembly collapse.
Recent copy expansion is especially tricky. If a repeat family expands rapidly, many copies will be very similar. A detector may find several plausible G-rich partners for one A-rich edited copy. Conversely, if an edited copy itself later served as a template, descendants may share the same edited sites. A pairwise pipeline could count those descendants as separate edited elements even though the mutational burst occurred once.
A modern solution should move beyond a single best BLAST hit. It should cluster all related copies, build a local sequence graph, and infer shared derived states. Sites shared across many A-rich copies with identical flanking divergence may indicate inheritance from one edited ancestor. Sites unique to one copy are better evidence of independent editing. This distinction matters enormously for estimating how often APOBEC attacked retroelements.
The consensus sequence also deserves careful handling. Repeat consensus sequences are often constructed from extant copies and can be biased toward abundant young subfamilies. If an edited sublineage is overrepresented, the consensus can absorb edited bases and reduce sensitivity. Subfamily-specific consensus construction helps, but only if subfamilies are finely resolved. For complex families, phylogeny-aware ancestral reconstruction may outperform simple consensus comparisons.
Another useful control is reciprocal direction testing. Instead of only asking whether the A-containing copy is edited relative to G, ask whether an A-to-G model explains the data equally well. If G-to-A has strong motif enrichment and A-to-G does not, the APOBEC model gains support. If both directions look similar, the case should be downgraded.
Finally, a detector should report the object it has inferred. Did it infer edited sites, edited copies, source-copy relationships, or independent ancestral editing episodes? These are different biological quantities. Pairwise source-copy methods are excellent for detecting candidate edited copies. They are less reliable for counting the number of original APOBEC-exposed molecules unless duplicate collapse and phylogenetic reconstruction are added.
Key technical takeaway: APOBEC detection depends on reconstructing mutation direction. Consensus and source-copy filters are powerful, but recent expansion and shared ancestry can blur the difference between many edited copies and many independent editing events.