Thursday, February 26, 2015

Dog genome annotation quality has improved during the past 30 Ensembl releases

Quality of genome annotation changes with each new Ensembl release. How much does the quality actually change? 

We can quantify the amount of change in annotation quality by looking at specific parameters like number of genes, transcripts, exons etc. Here, the annotation of the dog genome from Ensembl release 48 to 78 is compared to understand the effect of change in genome assembly as well as role of annotation curation. 

First, we see the number of annotated genes(y-axis) in the Dog genome vs the Ensembl release(x-axis). It can be seen that the number of genes steadily  keeps increasing with each release. However, the sharp drop in number of genes in release 68 is surprising. The drop in number of genes from release 67 to 68 is less surprising as that release was accompanied by a change in genome assembly from BROADD2 to CanFam3.1
Various other features of the genome annotation such as Number of transcripts per Gene, Median coding length, Average Exons per gene and 3' UTR count change in a very different way (see below figure)

While such large changes in the genome annotation between releases might be scary, the engine of science chugs on. In the broader scheme of things, the fact that the Dog has 23,062 genes or 24,660 genes should hopefully not matter for the Dog breeders who seem to be driven by the discoveries made possible by the Dog genome and its annotation.

Monday, February 23, 2015

Effect of genotype qaulity on the site frequency spectrum

The below shell script takes vcf file as an argument in the for loop and calculates the allele frequencies using VCFTools. These allele frequencies are used to calculate a folded SFS by taking only the major alleles in the calculation.
 for VCF in vcf_file_name.vcf  
 mkdir "$VCF".dir  
 cd "$VCF".dir  
 cp ../data/"$VCF" .  
 for GQ in 20 40 60 80  
 #calculate allele frequencies with VCFtools  
 vcftools --vcf "$VCF" --freq --out --minGQ "$GQ"  
 #Take vcf file and extract columns with allele frequency data, in this case the freq output we want columns five and six  
 cut -f5 > test_col1  
 cut -f6 > test_col2  
 cat test_col1 test_col2 > test_append  
 #removes blank line  
 sed -e '/^ *$/d' test_append > test_append2  
 #get rid of first line and first two characters (a/t/g/c:)  
 sed 1d test_append2 > test_append3  
 cat test_append3 | sed 's/^..//' > FILENAME_AFS  
 #To keep only values below 0.50  
 awk '$NF <0.50' FILENAME_AFS > FILENAME_AFS_50."$GQ"  
 echo "0" >> FILENAME_AFS_50."$GQ"  
 echo "0.1" >> FILENAME_AFS_50."$GQ"  
 echo "0.2" >> FILENAME_AFS_50."$GQ"  
 echo "0.3" >> FILENAME_AFS_50."$GQ"  
 echo "0.4" >> FILENAME_AFS_50."$GQ"  
 echo "0.5" >> FILENAME_AFS_50."$GQ"  
 rm FILENAME_AFS test_append test_append2 test_append3 test_col1 test_col2  
 cp ../makebars.r .  
 Rscript makebars.r  
 cd ..  

The rscript makebars is used to read in each of the site frequency spectrum's at the different genotype quality thresholds and generate a barplot of the SFS.
 data.m <- melt(data, id.vars='Names')  
 cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")  
 ggplot(data.m, aes(Names, value,fill = variable)) + geom_bar(stat="identity",position="dodge")+ xlab("Allele frequency bins") + ylab("Number of sites")  
When the VCF file from a published study is run through the script will generate a graph like below:

Friday, February 20, 2015

NOD Mice Substrain Divergence - Runs of Homozygosity

Divergence or rather Differentiation (to use the right terminology), between natural ecotypes or subspecies has been studied in multiple pairs. The broad-scale patterns and underlying processes for such genome wide changes has been studied in detail

Do similar processes govern the "divergence" of  artificially inbred strains? Is Drift too strong for background selection to be relevant in such inbred lines? Data [Genetic Analysis of Substrain Divergence in NOD Mice] from a recent study of NOD mice sub-strains might provide some answers. While the study was restricted to sites fixed to homozygosity, it does provide some idea of the major regions that differ between the mouse strains. More detailed analysis of the mouse strains including multiple individuals might be necessary to understand if the diversity landscape of the genome has played any role in shaping the accumulation of genetic differences. 

The Supplementary table 4 provides a list of all the SNP's genotyped using the Mouse Diversity Array. We try to identify runs of homozygosity that share the same genealogy and note if these regions correspond to the deletions identified and validated in the study.

ChrPos (mm10)Idd.regionSnpIdTypeBomTacMrkTacShiJclShiLtDvsShiLt
2131300866Idd13, Idd13-not B2mJAX00503778A/GAAAGA

In the above table the runs of fixed differences and what they are is listed below:

  1. First run (marked in blue) of fixed differences corresponds to the Idd26 locus.
  2. Next is the Idd5 locus. 
  3. The deletion on chromosome 3 spanning 24,272,052 and 24,383,100 has been validated by Sanger sequencing in this study.
  4. The second deletion on chromosome 3 spanning as many as 6 SNP's is located rather close to the first deletion.
The remaining clusters of 2 or 3 sites that share the same genealogy are probably too short to correspond to single events themselves. Nonetheless, it would be interesting to see if these fixed differences are localized to lower nucleotide diversity regions in the initial strains.

Tuesday, February 17, 2015

Bird RNAseq - A thing to watch out for while using the Zebra Finch genome as a reference

Long after our simulation study on RNA-seq, itseems that it is finally time to see some real world use for it. While getting cited many times might suggest some utility for it, being able to actually help real world applications like curing cancer or malaria seems a better reward. 

Downloading a dataset of reads(from a certain bird species) mapped onto the Zebra Finch genome from SRA and calculating the mean mapping quality per gene shows that genes with an Ensemble id greater than ~14,000 have a lower mean mapping quality compared to the rest of the genome. 

This result could potentially be due to the increased mis-mapping seen in our simulation study (see Figure-4). Would discarding multi-mapped reads (zero mapping quality) get rid of this bias? Should this be a issue of major concern? Hopefully, followup studies will appreciate the importance of such bias and try to avoid it or correct for it. 

Monday, February 9, 2015

Ten percent of positively selected variants are within 7 Megabase of the human genome

Ten percent of SNPs that have been found to be positively selected as per the dbPSHP are within a 7 Megabase region on chr 2. This one region is very different from all other regions that have atleast 100 positively selected SNPs in that it has more than 1500 such variants.

1    154500000    167400000    121
10    100100000    113400000    108
12    52200000    62400000    125
14    59800000    70900000    117
15    62900000    79100000    121
17    50350000    64950000    125
18    63400000    67900000    135
2    62400000    75750000    131
2    95500000    101450000    111
2    131100000    137750000    1581
2    175250000    181150000    150
2    194600000    201750000    150
22    28200000    43700000    106
5    9800000    15600000    111
5    106700000    113850000    108
6    25050000    40450000    144
9    97850000    101000000    106
9    122600000    135200000    168

If so many positively selected variants are within such a short distance of each other, can they be considered independent events? Should methods that estimate selection coefficient's consider the cumulative value or just per variant estimates?

Sunday, February 8, 2015

Yet another RAD-seq processing tool when genome is available

This is just a simple set of commands to create a RAD-seq processing tool when the genome is available. Since, it accepts reads mapped to genome in a bam file as input, it is capable of handling RAD-seq, exome sequencing as well as low coverage re-sequencing data. Paired-end as well as single end reads are accepted.

 module load samtools  
 module load BEDTools  

 #calculate per site coverage for each individual bam file  
 samtools depth *.sorted.bam > all.coverage  
 #sum the persite coverage across all individuals  
 awk '{for(i=3;i<=NF;i++) t+=$i; print t; t=0}'|sed 's/ /\t/g' all.coverage > persite.coverage  
 #make a histogram of persite coverage across all individuals  
 cut -f 3 persite.coverage |awk '{counts[$1] = counts[$1] + 1; } END { for (val in counts) print val, counts[val]; }'|sort -k1n,1 > persite.coverage.hist  
 #remove sites with more than 200 reads(as inferred from histogram from previous step) per site  
 awk '{for(i=3;i<=NF;i++) t+=$i; if(t<200)print $0; t=0}'|sed 's/ /\t/g' all.coverage > non.repeat.sites  
 #find sites that have atleast 5 reads in 10 individuals  
 awk '{for(i=3;i<=NF;i++) if($i > 5)t++; if(t>10)print $1,$2,$2,t; t=0}' non.repeat.sites|sed 's/ /\t/g' > wellcovered.sites  
 #collapse sites into loci  
 bedtools merge -i wellcovered.sites > wellcovered.sites.bed  
 #call SNP's for selected loci  
 samtools mpileup -l wellcovered.sites.bed -uf ref.fa *.sorted.bam | bcftools view -bvcg - > var.raw.bcf   
 bcftools view var.raw.bcf | varFilter -D100 > var.flt.vcf   

Monday, February 2, 2015

Runs of short exons in Titin gene are conserved across all Mammals

Titin gene more commonly known as Connectin is an elastic muscle protein. The complete gene sequence for this gene has been known since 2001. 

Here we compare the exon structure of Titin transcripts across multiple species with special focus on the cluster of short exons(< 100 bp) in the middle of the gene. Exon ranks of these exons is from ~100 to 180.  These exons correspond to the PEVK segments that are thought to provide the protein "spring" like properties. Comparison of the Titin gene across multiple species has previously been performed across large evolutionary distances. 

All four rodent species [Kangaroo rat (Dipodomys ordii), Mouse (Mus musculus), Pika (Ochotona princeps) and Rabbit(Oryctolagus cuniculus)] from Ensembl show the PEVK exons in phase 1 followed by a pair of 3 exons as seen the human titin gene. 

Kangroo Rat

Similarly, all 12 species from Laurasiatheria[Alpaca (Vicugna pacos), Cat (Felis catus),Cow (Bos taurus), Dog (Canis lupus familiaris), Dolphin (Tursiops truncatus), Ferret (Mustela putorius furo), Hedgehog (Erinaceus europaeus), Horse (Equus caballus), Megabat (Pteropus vampyrus), Microbat (Myotis lucifugus), Panda (Ailuropoda melanoleuca), Sheep (Ovis aries)] also have the PEVK cluster.







The European Hedgehog has only 30 exons annotated in Ensembl release 78. This might change in later releases. So Functional characterization or some other form of validation is required before considering the possibility of the loss of large number of exons.

All 8 primate genomes [Chimpanzee(Pan troglodytes), Gibbon (Nomascus leucogenys), Gorilla (Gorilla gorilla gorilla), Marmoset (Callithrix jacchus), Olive baboon (Papio anubis), Orangutan (Pongo abelii), Tarsier (Tarsius syrichta) and  Vervet-AGM (Chlorocebus sabaeus)] are very similar to the Human titin. The Gibbon TTN gene is not showing the characteristic run of short exons, but this could be due to the Gibbon genome being relatively new.




Olive babbon


While we have looked at birds before, it has not been a complete coverage. Moreover the ortholog for the TTN gene has been named the SPEG complex locus. So here we use the SPEG complex locus transcripts instead of the transcript from the gene annotated as Titin. We can see that the Chinese Turtle (Pelodiscus sinensis) shows the characteristic pattern of the PEVK exons. However, none of the transcripts annotated in birds [falbicollis_ENSFALG00000003453,ggallus_ENSGALG00000028386,acarolinensis_ENSACAG00000013938,mgallopavo_ENSMGAG00000011306 tguttata_ENSTGUG00000006089]
Chinese Turtle
show the pattern.

Another post dealing with Fish, insects and other species might be able to find other interesting changes in the Titin gene during the course of evolution.