Wednesday, October 1, 2014

Drosophila transcriptome reveals interesting patterns in splicing diversity

The Drosophila transcriptome has been analysed in different conditions to understand its diversity. Here, we re-analyse the supplementary materials(only from supplementary table-3) and find "interesting" patterns in splicing diversity.

 read.table(file="supp3",header=T)->S  
 #calculate percentage of genes with more than 1 isoform  
 length(S$transcripts[S$transcripts>1])*100/ length(S$transcripts)  
 #calculate percentage of genes with more than 1 protein  
 length(S$proteins[S$proteins>1])*100/ length(S$proteins)  
 summary(S)  

Over half of the 17,564 annotated genes (10,136; 57%) have more than one transcript isoform. However, only 37% (6,584) of the genes have more than one protein, suggesting that most of the transcript diversity is outside the protein coding regions. While the genes have on average 17.39 transcripts, only 2.59 proteins are found on average per gene. 

This contrast between the number of transcripts and proteins is very pronounced in genes with very large number of transcripts. Infact, the gene "gish" which has 18,972 transcripts has only 142 proteins. See below plot that shows the number of records with the words "exon" and "CDS" from the "gish" gene in the new annotation.


The above figure clearly shows that most "exon" records occur at the beginning and end of the gene. The "CDS" records are restricted to the core of the gene. The same pattern is better captured by the authors of the study using the new metric `per cent spliced in’ (Ψ) index.





No comments: