Nagarjun's blog: 2012

Tuesday, October 16, 2012

Plant MicroRNA validation

Identification and validation of plant MicroRNA has been summarized by Meyers et.al., in the article Criteria for Annotation of Plant MicroRNAs. They build upon the previous article by Ambros et. al., which had expression and biogenesis criteria.

The expression criteria were:

Hybridization of a "specific" RNA probe to a size selected RNA sample, using a method like Northern blotting.
Identification of the "specific" RNA sequence in a size selected cDNA library. It is expected that this sequence matches the genomic sequence of the organism from which they were cloned.Various sequencing technologies have been used to generate size selected cDNA libraries.

The Biogenesis criteria were:

Structure prediction should support a fold-back precursor that contains the "specific" miRNA sequence within one arm of the hair-pin. Morever, this hairpin should have the lowest free energy as per a RNA-folding program like mfold "and must include at least 16 bp involving the first 22 nt of the miRNA and the other arm of the hairpin". Apart from meeting these criteria, the hairpin should also be free of any internal loops or large asymmetric bulges.
The "specific" sequence and its predicted precursor fold-back secondary structure should be conserved.
Increased accumulation of precursor in organisms with reduced Dicer function.

Meeting any of these criteria individually is not sufficient to be considered a miRNA as even siRNA's meet the expression criteria and biogenesis criteria are not specific to miRNA's. Hence, both expression and biogenesis criteria are required for proper validation of miRNA as per Ambrose et.al., The more recent (2008) set of guidelines by Meyers et. al., tries to utilize the knowledge gained by studies in the 5 years since the initial criteria by Ambrose et.al in 2003.

The criteria set up by Meyers et. al., are grouped under primary and ancillary criteria with various precautions to be taken. It also has information about assigning miRNA's to families.

Primary criteria include presence of miRNA sequence in cDNA library and its validation by hybridization experiments and adherence to the expected stem-loop structure. Datasets with low coverage are to be treated with caution as they have chance of mistaking a siRNA as a miRNA.

While ancillary criteria can support a miRNA candidate that meets the primary criteria, it is considered neither necessary nor sufficient for a prediction. However, "clear" conservation is generally sufficient to annotate an miRNA, as long as it has satisfied the primary criteria in the organism in which it has an homolog. Other criteria such as miRNA target prediction, DCL1 dependence, RDR and PolIV, PolV independence are useful for obtaining biological function information but not enough to validate a sequence as miRNA.

Recently, many new automated pipelines have been published for automating the process of miRNA identification, validation, annotation and target prediction.

miRTour has a easy to use web interface to upload the EST/contigs but it has a 50Mb limit on the size of the dataset that can be uploaded.
PIPmiR (Pipeline for the Identification of Plant miRNAs) provides an executable that can be downloaded and used.It has been used to predict both known and novel miRNAs in Arabidopsis.
shortran: a pipeline for small RNA-seq data analysis is also available for download.
miRDeep-P is a version of miRDeep modified for use with plant transcriptomes.
sRNA toolkit is a more general solution that identifies not only miRNA but other small RNA's.

The availability of numerous automated pipelines for plant miRNAs can be useful, but at the same time introduce its own set of artifactual problems.

Monday, October 15, 2012

Impossible genomes?

Few genomes are difficult (with current state of technology) to assemble due to their bizzare characteristics. The high cost of sanger sequencing, construction of FOSMID or BAC libraries, flow sorting of chromosomes and other wonderful methods makes the assembly of genomes with NGS methods difficult. However, even Sanger based methods find it difficult to sequence some genomes that are almost impossible to assemble to "completion". Genomes can be difficult due to reasons such as:

Large size of genome: A genome that is very large (has many many bases) are difficult to sequence, mainly due to the higher costs involved in generating sufficient coverage. Largest known vertebrate genome is that of the Lungfish with a size of 133 Gb and canopy plant being the largest known plant genome with a size of 150 Gb. An amoeboid, Polychaos dubium might have the largest genome with a size of 670 Gb. Larger amounts of data are difficult to handle bioinformatically. Infact most assemblers would be unable to handle large amounts of data associated with these genomes. Moreover, these genomes are thought to be filled with repeats and genome duplication events making their assembly even more complicated.
Repeat content of genome: Certain genomes are known to have very high transposon activity making them rich with repetitive content. These genomes need not be large, but can still be difficult to assemble due to the almost identical copies of DNA prevalent in the genome.
Extremes in GC content: Certain genomes are known to have very high or very low GC content. This makes them difficult to sequence due to the bias involved in NGS methods. Although GC content extremes are constrained by the requirements imposed by the genetic code, Streptomyces coelicolor manages to have a GC-content of 72% while Plasmodium falciparum has just 20%. Apart from extremes in genome wide GC content, parts of the genome can have extremes in GC content making them difficult to sequence.
Rarity of sample: Some organisms are so rare, that its almost as if they were extinct. Being able to find such species and obtaining enough DNA from them can be almost impossible. The situation is made more complicated by various legal, ethical and technical issues. Rarity of sample, could also be a result of the amazingly tiny amounts of DNA available in certain species. Cultivation of many microbial species in the lab is not yet possible and obtaining enough DNA from such species has driven research in the field of metagenomics and more recently single cell sequencing. DNA from extinct species is of lower quality and filled with many artifacts making correct assembly of genomes a daunting task. However, many of these problems have been overcome by novel methods and extinct species such as the mammoth, neanderthals, denisovan.... have been sequenced and assembled to a quality comparable to that of other NGS genome assemblies.
Genome definition inconsistencies: To be able to assemble the genome of a certain species, it should be possible to define a species and what constitutes its genome. Symbiotic organisms can be difficult to delineate into distinct species, due to the high degree of inter-dependence of these species. The definition of species is a controversial subject and different interpretations of these definitions makes it controversial to claim sequencing of a particular "species".
Dynamic nature of genomes: Genomes are generally though of as stable inherited genetic material which remain exactly the same over short periods of time. However, the genomes have many dynamic features. Telomeres change in length with age in almost all species. Similarly, small viral genomes with very high mutation rates can change drastically within the span of a few hours making them completely immune to a drug or conferring a new phenotype. Such changes will require re-sequencing of the genome to identify the changes to the genome.Species with different numbers of chromosomes along a cline are another type of dynamism.

Many other genome can be difficult to assemble due to other reasons?

Wednesday, October 10, 2012

5 rules of pollex about genomics

Ewan Birney wrote a post on 10 rules of thumb in genomics almost a year ago. It is very true and will probably remain like that for "some" more time. Few more rules of pollex in no particular order

Trust the box, but know that the box is not perfect. Many steps in genomics require us to trust a black box, be it genome assemblers (that has way too much code to inspect) or simple read trimming programs. Moreover, manual verification of all data is impossible due to the shear size of the data.
Default values are not always the best values.All most all bioinformatics tool have half a dozen parameters which have some default value set in them. However, these default values might not be suitable for your data or analysis. Not being aware of the various options can have unpredictable results.
New programs keep coming up everyday.Unless the robustness of these programs has been tested by independent users, using these can be tricky.
Changes in technology are very fast. Methods keep getting better all the time and can add more to an analysis. A bit contradictory to the previous point.
Validate your results by various other methods. Artifacts are not unheard of in NGS data, and require validation from other methods.

Tuesday, September 25, 2012

The story of missing genes

The recent explosion in the number of genomes being sequenced has lead to interesting experiments that use genome assemblies to test hypothesis ranging from gene loss and gain, pathway enrichment analysis, gene repertoire evolution etc. However, many reviews that have analysed NGS assemblies have shown that genome assembly quality and annotation methodology determines gene content.

Human genome assemblies generated with the NGS data are completely missing more than 83 genes available in the Sanger genome assemblies. While no single functional category of genes are missing from the assemblies, genes located in regions difficult to assemble are lost often. MHC genes are notorious for being difficult to assemble. Analysis of MHC region using draft assemblies can lead to underestimating or predicting the loss genes in a lineage due to poor assembly quality.

Various programs have been developed to find these "missing genes" that are either lost in gaps in the assembly or too fragmented to be readily recognizable or even miss-assembled. For example, IMAGE genome assembler tries to find the parts that are lost in gaps. SOAP Gap Closer also closes gaps in the assembly. GAP Filler from the makers of SSPACE claims to be superior to both IMAGE and SOAP Gap closer in its ability to better predict gap sizes and also use multiple libraries without a corresponding increase in memory usage. Being able to replace N's in genome assemblies with the "correct bases" is very useful in being able to recover various features of the genome which were essentially lost in gaps.

Longer pacbio read based assembly of budgerigar genome is probably a way out of these poor assemblies. Genes like FOXP2 were found to be fragmented in assemblies that did not use the PBcR (PacBio) reads. These fragmented genes could be recovered in the assemblies that used PBcR reads. However, few regions needed a combination of different technologies to be able to recover certain regions. Its not surprising that NGS assemblies have so many missing bits when even the "completed" human genome is still being updated 13 years after being published. Advances not only in sequencing technology but also our understanding of the diversity in the genome and its representation will probably change the way we think of a genome assembly in the future.

Monday, September 24, 2012

Christian analogies in biology

Its amusing at times to find Christian analogies in biology (CAB) being misinterpreted and misused to the amusement of some and frustration of others. Hope many more such analogies are used in the future leading to much more bemusement. The inter-related nature of religion means that these words are claimed by multiple religious groups.

The purpose of this post is just to list these analogies to serve as some sort of a mnemonic. Think of them as bible analogies in biology (BAB) or Christian analogies in biology (CAB) or whatever else you fancy.

Mitochondrial Eve and Y-chromosomal Adam: Used to denote the MRCA based on the mitochondria and Y-chromosome. Based on Adam and Eve, the first humans in the Bible.
Lazarus taxon:Used to denote species that are missing from fossil records for some time and then re-appear at a later stage. Based on Lazarus who is restored to life by Jesus after being dead in the Gospel of John.

Any more?

RNA-seq challenges and strategies

RNA-seq is being used by a diverse group of users for investigating very diverse problems. De novo assembly has been seen as easy to use and effective way to utilize the power of RNA-sequencing in non-model organisms. While genome assembly programs have been investigated by various reviews and others such as ASEMBLETHON and GAGE, de novo transcriptome assemblers have not been investigated in great detail. We looked at de novo transcriptome assemblies generated by Illumina assemblers from many perspectives. Apart from the obvious structural error characterization, other factors such as sequencing error, polymorphism (pi), paralogs to name a few.

Using real data would be ideal to test programs, however, in this case it becomes difficult to distinguish artifacts from novel biology, so simulation is a good idea to perform some sort of validation and quantification of errors of various types.

The simulation paper on RNA-seq for the special issue on Next-gen sequencing is finally available online : Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA-seq experiments.

Tuesday, September 11, 2012

Why god is so fond of beetles?

Beetles form insect order Coleoptera, which is very rich in diversity. JBS Haldane's quote "that god has an inordinate fondness for beetles" is used to portray this rather astonishing fact. Various experiments have been done to identify the reason for the larger diversity found in beetles.

Reasons such as host specificity driven speciation, specifically herbivory and angiosperm evolution in particular is suspected.While this study has contradicted this line of reasoning based on large scale phylogenies.

Higher survival capabilities of the beetle lineages has been attributed to their pre-Cretaceous origins. Was this diversity a result of an adaptive radiation that has been sustained by further diversification into a variety of niches? Does the beetle have a better ability to adapt? What makes it more adaptable?

Do the reasons for the diversification come from predators and defense mechanisms that beetles had to develop to escape these predators. In fact beetles use mimicry, chemical defenses, strong mandibles or horns or spines to deter predators.

Sex chromosomes of beetles have interesting biology and could play a role in the generating diversity.Frequent changes in sex determination mechanisms as has been seen in many insect groups could be important.

Could the genome shed some light on the reasons? The genome of the red flour beetle (Tribolium castaneum) has been sequenced. Presence of genes involved in detoxification, development, cell-cell communication and various other genes involved in ability to interact with diverse environments might explain the extraordinary ability of the beetles to adapt.

On the contrary Haldane was a self professed atheist who thought that "My practice as a scientist is atheistic. That is to say, when I set up an experiment I assume no god, angel or devil is going to interfere with its course... I should therefore be intellectually dishonest if I were not also atheistic in the affairs of the world.". So may be he was making a more complicated point which we mere mortals have failed to comprehend?

Tuesday, September 4, 2012

Cold dog Stockholm Sweden

The lion statue in Stockholm although not famous is known to people who visit the palace. The cold dog with rags drapped around it is almost not seen squatting in a corner.

Cold dog Stockholm, Sweden

Even though its not as majestic as the lion or one of the kings who ruled Sweden, its amazing in its own way. The rags which look so realistic are made out of metal, but on the first look very realistic. It is at home in the cold winters but even in the summer it is not out of place.

Get duplicate fasta sequences

ALLPATHS-LG had the DEDUP option turned off by default prior to release 42726. So this bit of code identifies exact duplicates either on the same or negative strand. Many of my runs had 30 to 60 Kb of sequence that was duplicated with almost equal amounts on both strands.

 #!/usr/bin/perl  
 use warnings;  
 # Input parameters  
 open FASTA, $ARGV[0] or die $!;  
 my $seqst_temp="";  
 my $seqs = {GENENAME =>my $genename,LEN =>my $qcom};  
 while($line = <FASTA>){  
 if($line=~ /^>/){  
 if($header){  
 if(exists $seqs{$seqst_temp}{GENENAME}){print "$seqs{$seqst_temp}{GENENAME}\t$header\t$seqs{$seqst_temp}{LEN}\n";}  
 $rseqst_temp = $seqst_temp;  
 $rseqst_temp=revcomp($rseqst_temp);  
 if(exists $seqs{$rseqst_temp}{GENENAME}){print "$seqs{$rseqst_temp}{GENENAME}\t$header\t$seqs{$rseqst_temp}{LEN}\treverse\n";}  
 $seqs{$seqst_temp}{GENENAME}=$header;  
 $seqs{$seqst_temp}{LEN}=length $seqst_temp;  
 }  
 chomp $line;  
 $header="";  
 $header=$line;  
 $seqst_temp="";  
 $rseqst_temp="";  
 }  
 else{  
 $line =~ s/[\n\t\f\r_0-9\s]//g;  
 $seqst_temp .= $line;  
 }  
 }#end of while loop  
 if($header){  
 if(exists $seqs{$seqst_temp}{GENENAME}){print "$seqs{$seqst_temp}{GENENAME}\t$header\t$seqs{$seqst_temp}{LEN}\n";}  
 $rseqst_temp = $seqst_temp;  
 $rseqst_temp=revcomp($rseqst_temp);  
 if(exists $seqs{$rseqst_temp}{GENENAME}){print "$seqs{$rseqst_temp}{GENENAME}\t$header\t$seqs{$rseqst_temp}{LEN}\treverse\n";}  
 $seqs{$seqst_temp}{GENENAME}=$header;  
 $seqs{$seqst_temp}{LEN}=length $seqst_temp;  
 }  
 close FASTA;  
 sub revcomp{  
     my $input = shift;  
     my $revcomp = reverse($input);  
     $revcomp =~ tr/ACGTacgt/TGCAtgca/;  
     return $revcomp;  
 }

Although this does the job, its ugly in having the same bit of code used twice, once within the loop and once after the while loop. May be the end of file can be handled more elegantly.

Wednesday, March 28, 2012

MD5 file list checker

This program reads a file with filenames with correct path and MD5 values separated by a tab character and checks if the MD5 is correct or not.

#!/usr/bin/perl
use strict;
use Digest::MD5 qw(md5_base64);

open MD5, $ARGV[0] or die $!;
my $line="";
while($line = ){
chomp $line;
my @files=split(/[ \t]+/,$line);
open(FILE, $files[1]) or die "Can't find file $files[1]\n";
my $digobj = Digest::MD5->new;
$digobj->addfile(*FILE);
$digest = $digobj->hexdigest;
close(FILE);
if($digest!~m/$files[0]/){print "Md5 does not match for file:" . $files[1];}
else{
print "Md5 match for file:$files[1]\n";
print $digest ."\n" . $files[0] . "\n";
}
}

It would make sense to have MD5 checks integrated into the OS and have a MD5 list in each folder. May be it will get added into the code of various programs like ftp, sftp or even ordinary copy and mv.

Sort fasta file

Many programs like GATK require the fasta files to be sorted before use. Here is a rather simple script for the job:

 #!/usr/bin/perl  
 open FASTA, $ARGV[0] or die $!;  
 my $temp="";  
 my $seqs = {SEQ =>my $fheader};  
 my $sortemp="";  
 while($line = <FASTA> ){  
 if($line=~ /^>/){  
 if($header){$seqs{$header}{SEQ}=$temp;}  
 chomp $line;  
 $header="";  
 $line =~ s/[\s]/_/g;  
 $header=$line;  
 $temp="";  
 }  
 else{$line =~ s/[\n\t\f\r_0-9\s]//g;$temp .= $line;}  
 }#end of while loop  
 if($header){$seqs{$header}{SEQ}=$temp;}  
 close FASTA;  
 foreach $sortemp (sort keys %seqs) {  
 print "$sortemp\n";  
 print "$seqs{$sortemp}{SEQ}\n";  
 }

However, you can find more elegant solutions that use Bioperl at Wolf/Takebayashi lab.