Tuesday, September 25, 2012

The story of missing genes

The recent explosion in the number of genomes being sequenced has lead to interesting experiments that use genome assemblies to test hypothesis ranging from gene loss and gain, pathway enrichment analysis, gene repertoire evolution etc. However, many reviews that have analysed NGS assemblies have shown that genome assembly quality and annotation methodology determines gene content.

Human genome assemblies generated with the NGS data are completely missing more than 83 genes available in the Sanger genome assemblies. While no single functional category of genes are missing from the assemblies, genes located in regions difficult to assemble are lost often. MHC genes are notorious for being difficult to assemble. Analysis of MHC region using draft assemblies can lead to underestimating or predicting the loss genes in a lineage due to poor assembly quality.

Various programs have been developed to find these "missing genes" that are either lost in gaps in the assembly or too fragmented to be readily recognizable or even miss-assembled. For example, IMAGE genome assembler tries to find the parts that are lost in gaps. SOAP Gap Closer also closes gaps in the assembly. GAP Filler from the makers of SSPACE claims to be superior to both IMAGE and SOAP Gap closer in its ability to better predict gap sizes and also use multiple libraries without a corresponding increase in memory usage. Being able to replace N's in genome assemblies with the "correct bases" is very useful in being able to recover various features of the genome which were essentially lost in gaps.

Longer pacbio read based assembly of budgerigar genome is probably a way out of these poor assemblies. Genes like FOXP2 were found to be fragmented in assemblies that did not use the PBcR (PacBio) reads. These fragmented genes could be recovered in the assemblies that used PBcR reads. However, few regions needed a combination of different technologies to be able to recover certain regions. Its not surprising that NGS assemblies have so many missing bits when even the "completed" human genome is still being updated 13 years after being published. Advances not only in sequencing technology but also our understanding of the diversity in the genome and its representation will probably change the way we think of a genome assembly in the future.

Monday, September 24, 2012

Christian analogies in biology

Its amusing at times to find Christian analogies in biology (CAB) being misinterpreted and misused to the amusement of some and frustration of others. Hope many more such analogies are used in the future leading to much more bemusement. The inter-related nature of religion means that these words are claimed by multiple religious groups.

The purpose of this post is just to list these analogies to serve as some sort of a mnemonic. Think of them as bible analogies in biology (BAB) or Christian analogies in biology (CAB) or whatever else you fancy.
  1. Mitochondrial Eve and Y-chromosomal Adam: Used to denote the MRCA based on the mitochondria and Y-chromosome. Based on Adam and Eve, the first humans in the Bible.
  2. Lazarus taxon:Used to denote species that are missing from fossil records for some time and then re-appear at a later stage. Based on Lazarus who is restored to life by Jesus after being dead in the Gospel of John.
Any more?

RNA-seq challenges and strategies

RNA-seq is being used by a diverse group of users for investigating very diverse problems. De novo assembly has been seen as easy to use and effective way to utilize the power of RNA-sequencing in non-model organisms. While genome assembly programs have been investigated by various reviews and others such as ASEMBLETHON and GAGE, de novo transcriptome assemblers have not been investigated in great detail. We looked at de novo transcriptome assemblies generated by Illumina assemblers from many perspectives. Apart from the obvious structural error characterization, other factors such as sequencing error, polymorphism (pi), paralogs to name a few. 

Using real data would be ideal to test programs, however, in this case it becomes difficult to distinguish artifacts from novel biology, so simulation is a good idea to perform some sort of validation and quantification of errors of various types.

The simulation paper on RNA-seq for the special issue on Next-gen sequencing is finally available online : Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA-seq experiments.


Tuesday, September 11, 2012

Why god is so fond of beetles?

Beetles form insect order Coleoptera, which is very rich in diversity. JBS Haldane's quote "that god has an inordinate fondness for beetles" is used to portray this rather astonishing fact. Various experiments have been done to identify the reason for the larger diversity found in beetles. 

Reasons such as host specificity driven speciation, specifically herbivory and angiosperm evolution in particular is suspected.While this study has contradicted this line of reasoning based on large scale phylogenies.

Higher survival capabilities of the beetle lineages has been attributed to their pre-Cretaceous origins. Was this diversity a result of an adaptive radiation that has been sustained by further diversification into a variety of niches? Does the beetle have a better ability to adapt? What makes it more adaptable?

Do the reasons for the diversification come from predators and defense mechanisms that beetles had to develop to escape these predators. In fact beetles use mimicry, chemical defenses, strong mandibles or horns or spines to deter predators.

Sex chromosomes of beetles have interesting biology and could play a role in the generating diversity.Frequent changes in sex determination mechanisms as has been seen in many insect groups could be important.

Could the genome shed some light on the reasons? The genome of the red flour beetle (Tribolium castaneum) has been sequenced. Presence of genes involved in detoxification, development, cell-cell communication and various other genes involved in ability to interact with diverse environments might explain the extraordinary ability of the beetles to adapt.

On the contrary Haldane was a self professed atheist who thought that "My practice as a scientist is atheistic. That is to say, when I set up an experiment I assume no god, angel or devil is going to interfere with its course... I should therefore be intellectually dishonest if I were not also atheistic in the affairs of the world.". So may be he was making a more complicated point which we mere mortals have failed to comprehend?

Tuesday, September 4, 2012

Cold dog Stockholm Sweden

The lion statue in Stockholm although not famous is known to people who visit the palace. The cold dog with rags drapped around it is almost not seen squatting in a corner.

Cold dog Stockholm, Sweden
Even though its not as majestic as the lion or one of the kings who ruled Sweden, its amazing in its own way. The rags which look so realistic are made out of metal, but on the first look very realistic. It is at home in the cold winters but even in the summer it is not out of place.

Get duplicate fasta sequences

ALLPATHS-LG had the DEDUP option turned off by default prior to release 42726. So this bit of code identifies exact duplicates either on the same or negative strand. Many of my runs had 30 to 60 Kb of sequence that was duplicated with almost equal amounts on both strands.

 #!/usr/bin/perl  
 use warnings;  
 # Input parameters  
 open FASTA, $ARGV[0] or die $!;  
 my $seqst_temp="";  
 my $seqs = {GENENAME =>my $genename,LEN =>my $qcom};  
 while($line = <FASTA>){  
 if($line=~ /^>/){  
 if($header){  
 if(exists $seqs{$seqst_temp}{GENENAME}){print "$seqs{$seqst_temp}{GENENAME}\t$header\t$seqs{$seqst_temp}{LEN}\n";}  
 $rseqst_temp = $seqst_temp;  
 $rseqst_temp=revcomp($rseqst_temp);  
 if(exists $seqs{$rseqst_temp}{GENENAME}){print "$seqs{$rseqst_temp}{GENENAME}\t$header\t$seqs{$rseqst_temp}{LEN}\treverse\n";}  
 $seqs{$seqst_temp}{GENENAME}=$header;  
 $seqs{$seqst_temp}{LEN}=length $seqst_temp;  
 }  
 chomp $line;  
 $header="";  
 $header=$line;  
 $seqst_temp="";  
 $rseqst_temp="";  
 }  
 else{  
 $line =~ s/[\n\t\f\r_0-9\s]//g;  
 $seqst_temp .= $line;  
 }  
 }#end of while loop  
 if($header){  
 if(exists $seqs{$seqst_temp}{GENENAME}){print "$seqs{$seqst_temp}{GENENAME}\t$header\t$seqs{$seqst_temp}{LEN}\n";}  
 $rseqst_temp = $seqst_temp;  
 $rseqst_temp=revcomp($rseqst_temp);  
 if(exists $seqs{$rseqst_temp}{GENENAME}){print "$seqs{$rseqst_temp}{GENENAME}\t$header\t$seqs{$rseqst_temp}{LEN}\treverse\n";}  
 $seqs{$seqst_temp}{GENENAME}=$header;  
 $seqs{$seqst_temp}{LEN}=length $seqst_temp;  
 }  
 close FASTA;  
 sub revcomp{  
     my $input = shift;  
     my $revcomp = reverse($input);  
     $revcomp =~ tr/ACGTacgt/TGCAtgca/;  
     return $revcomp;  
 }  

Although this does the job, its ugly in having the same bit of code used twice, once within the loop and once after the while loop. May be the end of file can be handled more elegantly.