Thursday, December 18, 2025

Post 3: Tools of the Trade — How Scientists Build Genomic Trees

 

Behind every elegant phylogeny lies a maze of data wrangling, alignment, modeling, and computation.


1. From Raw Genomes to Trees: The Journey of a Phylogenomic Analysis

Building a genome-wide phylogeny is no longer a weekend project. It’s a multi-stage process that transforms raw genomic sequences into a hypothesis of evolutionary relationships. Each stage has its own theory, best practices, and pitfalls.

Broadly, a phylogenomic pipeline includes:

  1. Data collection and ortholog identification

  2. Multiple sequence alignment (MSA)

  3. Alignment curation and quality control

  4. Model selection and phylogenetic inference

  5. Tree evaluation, visualization, and interpretation

What follows is a guided tour through each stage — not as a rigid protocol, but as an evolving toolkit that researchers can tailor to their question, taxon, and data type.


2. Step 1 — Data Collection: What Genomes and Which Genes?

Before running any software, the first question is: What data will actually be used?

a. Reference vs. Non-reference Genomes

Most phylogenomic studies begin with publicly available assemblies (e.g., NCBI, Ensembl, UCSC).
However, for underrepresented clades, researchers may need to generate or assemble new genomes, or supplement with transcriptomes.

Reference genomes offer completeness and quality control but may represent only a small subset of biodiversity. The choice depends on trade-offs between taxonomic coverage and annotation reliability.

b. Choosing Orthologous Genes

Genome-wide does not mean every gene. The key is to find orthologs — genes that diverged by speciation rather than duplication. Including paralogs (duplicated genes) can mislead analyses.

Common strategies include:

  • Reciprocal BLAST or best reciprocal hit (BRH) to detect ortholog pairs.

  • Clustering tools such as OrthoFinder, OrthoMCL, or OMA, which group genes into orthogroups using similarity networks.

  • Phylogenetic orthology inference, where gene trees themselves are used to separate orthologs and paralogs (as in PhylomeDB or SonicParanoid).

c. Filtering for Single-Copy Orthologs

To avoid paralogy problems, many workflows focus on single-copy orthologs — genes present exactly once per genome. Tools like BUSCO (Benchmarking Universal Single-Copy Orthologs) provide lineage-specific gene sets (e.g., mammalia_odb10).

Still, strict filtering can remove genes informative for deep divergences. Many modern studies therefore use orthogroups rather than strictly single-copy loci, modeling duplication events explicitly in species tree inference.


3. Step 2 — Multiple Sequence Alignment (MSA): Lining Up Evolution

Once orthologs are identified, sequences must be aligned to identify homologous positions — the raw material for phylogenetic signal.

a. Protein vs. Nucleotide Alignments

When genomes are well-annotated, researchers often align amino acid sequences first (more conserved, less saturated) and then back-translate to nucleotides for codon-aware analyses.

b. Alignment Algorithms

The field is dominated by a few widely used programs:

ToolKey FeaturesWhen to Use
MAFFTFast, accurate, handles large datasetsGeneral-purpose alignment
MUSCLEFast for small datasetsSimple protein or nucleotide alignments
PRANKAccounts for insertions/deletions evolutionarilyWhen indel accuracy matters
MACSECodon-aware, detects frameshifts and stop codonsCoding DNA alignments
Clustal OmegaStable, easy to automateSmall to medium-scale alignments

MAFFT remains the workhorse for large datasets due to its scalability and accuracy. However, codon-aware tools like MACSE are essential for preserving reading frame information in coding sequences.

c. Alignment Trimming

Not all aligned sites are trustworthy. Poorly aligned or ambiguously homologous regions can introduce noise. Tools such as:

  • trimAl

  • BMGE

  • Gblocks
    help remove unreliable columns while retaining informative positions.

In some cases (especially deep divergences), trimming can make or break the accuracy of a tree.


4. Step 3 — Quality Control: Garbage In, Garbage Out

Genome-wide phylogenies are only as good as their data.
Quality control involves filtering alignments for coverage, divergence, and compositional bias.

a. Checking Alignment Quality

Metrics such as average pairwise identity, alignment length, and percentage of gaps can flag problematic genes. Many researchers discard genes with excessive missing data or extremely short alignments (<100 codons).

b. Detecting Saturation

Highly saturated sequences — where multiple substitutions overwrite earlier ones — lose phylogenetic information. Programs like DAMBE or IQ-TREE’s --testmsa (saturation test) can quantify this. Filtering out saturated loci helps avoid misleading signal in deep-time analyses.

c. Outlier Detection

Some loci behave badly: they might support topologies inconsistent with the rest of the genome. Methods like TreeShrink, PhyloMCOA, or gene concordance factor (gCF) analyses can identify and exclude such outliers.


5. Step 4 — Model Selection: Evolution Isn’t One-Size-Fits-All

Before inferring a tree, one must decide how evolution is modeled.

a. Substitution Models

Molecular evolution is governed by substitution processes — how one nucleotide or amino acid changes into another. Models range from simple (Jukes-Cantor) to complex (GTR+Γ+I, LG4X, mixture models).

  • IQ-TREE and ModelFinder automate model testing using likelihood scores (AIC, BIC, AICc).

  • For codon data, codon substitution models (e.g., MG94, GY94) capture synonymous and nonsynonymous variation.

  • For amino acids, empirical models (WAG, LG, JTT) or mixture models (C60, PMSF) are popular.

b. Partitioning Strategies

Different genes or codon positions often evolve at different rates. Partitioning allows each subset to have its own model parameters. IQ-TREE and PartitionFinder can optimize partitioning schemes based on model fit.

c. Compositional Heterogeneity

When base composition varies across lineages, simple models can be misled by convergent biases. Heterogeneous models (e.g., CAT in PhyloBayes) or posterior mean site frequency (PMSF) corrections help counteract this effect.


6. Step 5 — Tree Inference: Building the Phylogeny

This is where biological insight meets computational brute force.

a. Maximum Likelihood (ML) Methods

ML methods estimate the tree topology and branch lengths that maximize the likelihood of observing the data under a given model.

  • RAxML-NG: successor to RAxML, optimized for large datasets and parallel computation.

  • IQ-TREE 2: fast, feature-rich, includes ultrafast bootstrapping, concordance analysis, and model selection.

  • FastTree: ultrafast approximation for preliminary analyses.

ML methods are fast and powerful but assume all genes share a single underlying species tree — a limitation when gene histories differ.

b. Bayesian Inference

Bayesian approaches estimate a posterior distribution of trees, integrating over model uncertainty.

  • MrBayes: widely used, suitable for small to moderate datasets.

  • PhyloBayes: implements complex site-heterogeneous models (e.g., CAT-GTR).

  • BEAST: specializes in time-calibrated and relaxed molecular clock analyses.

Bayesian inference is computationally intensive but provides rich statistical outputs, such as posterior probabilities and divergence time estimates.

c. Coalescent-Based Species Tree Methods

When gene trees disagree, coalescent methods estimate the species tree that best explains the observed gene tree discordance.

  • ASTRAL: most widely used summary method, scales well to thousands of loci.

  • MP-EST, STAR, and SVDquartets: alternative approaches under the multispecies coalescent framework.

  • BPP: Bayesian species tree estimation with explicit population parameters.

Coalescent methods embrace, rather than ignore, gene tree heterogeneity — making them indispensable for recent radiations or rapid divergences.


7. Step 6 — Support and Concordance Measures

a. Bootstrap and Ultrafast Bootstrap

Traditional nonparametric bootstrap replicates the dataset to assess branch stability.
IQ-TREE’s ultrafast bootstrap (UFBoot) offers a rapid alternative with comparable accuracy, producing support values (0–100%) for each node.

b. Gene and Site Concordance Factors (gCF, sCF)

Developed to quantify agreement among loci and sites, these metrics indicate what proportion of genes (or sites) support a given branch.

  • High bootstrap + low gCF → strong overall signal but high conflict among genes.

  • Low bootstrap + high gCF → genes agree, but total signal is weak.

Such measures help interpret confidence in branches beyond raw support numbers.

c. Bayesian Posterior Probabilities

In Bayesian trees, posterior probabilities indicate the likelihood of each clade given the data and model.
Values close to 1.0 indicate strong support, but caution is needed — posteriors are not directly comparable to bootstraps.


8. Step 7 — Tree Visualization and Interpretation

A tree is only as useful as it is interpretable. Visualization tools help summarize and communicate complex results.

ToolFeaturesNotes
FigTreeSimple, fast, publication-quality treesGreat for rooted trees
iTOL (Interactive Tree of Life)Web-based, supports annotations and metadataIdeal for large trees
ETE Toolkit / DendroPyPython libraries for programmatic tree manipulationIntegrate with pipelines
Archaeopteryx, SplitsTreeVisualize networks, gene tree conflictsFor non-tree-like relationships

Overlaying metadata (e.g., species traits, genome size, ecological niche) can transform static trees into multidimensional evolutionary narratives.


9. Integrative Workflows and Automation

Modern analyses involve hundreds of loci and complex parameter sets. Manual workflows are prone to error and irreproducibility.
Enter workflow managers:

  • Snakemake and Nextflow: automate pipelines, manage dependencies, ensure reproducibility.

  • PhyloSuite, PhyloPyPruner, and SUPERSMART: integrate multiple phylogenetic tools under unified interfaces.

  • Docker / Singularity containers: encapsulate software environments for cross-platform reproducibility.

Version control (e.g., GitHub) and metadata documentation are now as important as alignment accuracy. Many journals increasingly require phylogenetic reproducibility statements.


10. Evaluating and Comparing Trees

Even after inference, the real work often begins: testing whether your tree makes biological sense.

a. Topological Comparisons

Programs like TreeCmp, RF distance, or quartet distance can compare alternative trees quantitatively.

b. Hypothesis Testing

Statistical tests like the Approximately Unbiased (AU) test, Shimodaira–Hasegawa (SH) test, or Bayes Factors evaluate whether one topology fits the data significantly better than another.

c. Cross-Validation with Other Data

Always compare genomic trees with morphological, ecological, or fossil evidence. The most robust phylogenies integrate multiple lines of data rather than relying on one data type alone.


11. Putting It Together: Example Workflows

Example 1 — Concatenated Supermatrix Approach

  1. Identify orthologs (OrthoFinder)

  2. Align (MAFFT, codon-aware with MACSE)

  3. Trim (trimAl)

  4. Concatenate alignments (AMAS)

  5. Model selection and inference (IQ-TREE with partition models)

  6. Assess support (UFBoot + gCF/sCF)

  7. Visualize (iTOL)

Example 2 — Coalescent Species Tree Approach

  1. Identify orthologs

  2. Align and trim each gene

  3. Infer individual gene trees (IQ-TREE or RAxML)

  4. Summarize with ASTRAL

  5. Evaluate concordance and visualize conflict (PhyParts, BUCKy)

Each approach has trade-offs. Concatenation captures more total signal but risks overconfidence under model violation; coalescent methods handle conflict better but depend on accurate gene trees.


12. Emerging Trends in Phylogenomic Tool Development

The field is rapidly evolving, with exciting directions:

a. Scalable Phylogenetics

Tools like ExaML, RAxML-NG, and ParGenes leverage high-performance computing, GPUs, and cloud architectures to handle petabyte-scale datasets.

b. Machine Learning and Differentiable Phylogenetics

AI-driven methods are emerging to learn substitution models directly from data or approximate tree likelihoods using neural networks (e.g., DeepTree, Phyloformer). While experimental, these approaches hint at a new era of model-free phylogenetic inference.

c. Integrated Gene–Species Tree Reconciliation

Next-generation methods (e.g., GeneRax, SpeciesRax) jointly infer gene trees and species trees, accounting for duplication, loss, and horizontal transfer — a true synthesis of genome evolution.

d. Real-Time and Streaming Phylogenetics

For outbreak tracking and environmental metagenomics, tools like Nextstrain, UShER, and Treerecs allow near-real-time tree updates as new sequences arrive.


13. Lessons from the Trenches

After two decades of phylogenomics, several practical lessons have emerged:

  • Don’t chase the biggest dataset; chase the cleanest.
    Hundreds of well-behaved genes beat thousands of messy ones.

  • Always inspect alignments manually — at least for a subset.
    Automated tools can’t catch every misalignment.

  • Model complexity is good, but parsimony still matters.
    Over-parameterized models can overfit small datasets.

  • Conflict is inevitable — understand it, don’t hide it.
    Gene tree discordance carries valuable evolutionary information.

  • Reproducibility is the new rigor.
    Share code, document steps, and make your workflow transparent.


14. The Takeaway

Genome-wide phylogenies are now central to evolutionary biology, ecology, and genomics. But they are not built by pressing a “build tree” button. Each tree represents thousands of decisions — biological, computational, and conceptual.

Understanding the tools behind the trade transforms phylogenetics from a black box into a creative, critical, and evidence-driven science. It’s this combination of biology, statistics, and computation that makes phylogenomics both challenging and profoundly rewarding.


Next Up:

In Post 4 — “Databases of Life: Where Genome-Wide Phylogenies Live”, we’ll explore how these trees are stored, shared, and reused — from TreeBASE and PhylomeDB to the Open Tree of Life — and discuss the promise and pitfalls of turning vast evolutionary data into living, connected databases.

No comments: