Nagarjun's blog: October 2012

Tuesday, October 16, 2012

Plant MicroRNA validation

Identification and validation of plant MicroRNA has been summarized by Meyers et.al., in the article Criteria for Annotation of Plant MicroRNAs. They build upon the previous article by Ambros et. al., which had expression and biogenesis criteria.

The expression criteria were:

Hybridization of a "specific" RNA probe to a size selected RNA sample, using a method like Northern blotting.
Identification of the "specific" RNA sequence in a size selected cDNA library. It is expected that this sequence matches the genomic sequence of the organism from which they were cloned.Various sequencing technologies have been used to generate size selected cDNA libraries.

The Biogenesis criteria were:

Structure prediction should support a fold-back precursor that contains the "specific" miRNA sequence within one arm of the hair-pin. Morever, this hairpin should have the lowest free energy as per a RNA-folding program like mfold "and must include at least 16 bp involving the first 22 nt of the miRNA and the other arm of the hairpin". Apart from meeting these criteria, the hairpin should also be free of any internal loops or large asymmetric bulges.
The "specific" sequence and its predicted precursor fold-back secondary structure should be conserved.
Increased accumulation of precursor in organisms with reduced Dicer function.

Meeting any of these criteria individually is not sufficient to be considered a miRNA as even siRNA's meet the expression criteria and biogenesis criteria are not specific to miRNA's. Hence, both expression and biogenesis criteria are required for proper validation of miRNA as per Ambrose et.al., The more recent (2008) set of guidelines by Meyers et. al., tries to utilize the knowledge gained by studies in the 5 years since the initial criteria by Ambrose et.al in 2003.

The criteria set up by Meyers et. al., are grouped under primary and ancillary criteria with various precautions to be taken. It also has information about assigning miRNA's to families.

Primary criteria include presence of miRNA sequence in cDNA library and its validation by hybridization experiments and adherence to the expected stem-loop structure. Datasets with low coverage are to be treated with caution as they have chance of mistaking a siRNA as a miRNA.

While ancillary criteria can support a miRNA candidate that meets the primary criteria, it is considered neither necessary nor sufficient for a prediction. However, "clear" conservation is generally sufficient to annotate an miRNA, as long as it has satisfied the primary criteria in the organism in which it has an homolog. Other criteria such as miRNA target prediction, DCL1 dependence, RDR and PolIV, PolV independence are useful for obtaining biological function information but not enough to validate a sequence as miRNA.

Recently, many new automated pipelines have been published for automating the process of miRNA identification, validation, annotation and target prediction.

miRTour has a easy to use web interface to upload the EST/contigs but it has a 50Mb limit on the size of the dataset that can be uploaded.
PIPmiR (Pipeline for the Identification of Plant miRNAs) provides an executable that can be downloaded and used.It has been used to predict both known and novel miRNAs in Arabidopsis.
shortran: a pipeline for small RNA-seq data analysis is also available for download.
miRDeep-P is a version of miRDeep modified for use with plant transcriptomes.
sRNA toolkit is a more general solution that identifies not only miRNA but other small RNA's.

The availability of numerous automated pipelines for plant miRNAs can be useful, but at the same time introduce its own set of artifactual problems.

Monday, October 15, 2012

Impossible genomes?

Few genomes are difficult (with current state of technology) to assemble due to their bizzare characteristics. The high cost of sanger sequencing, construction of FOSMID or BAC libraries, flow sorting of chromosomes and other wonderful methods makes the assembly of genomes with NGS methods difficult. However, even Sanger based methods find it difficult to sequence some genomes that are almost impossible to assemble to "completion". Genomes can be difficult due to reasons such as:

Large size of genome: A genome that is very large (has many many bases) are difficult to sequence, mainly due to the higher costs involved in generating sufficient coverage. Largest known vertebrate genome is that of the Lungfish with a size of 133 Gb and canopy plant being the largest known plant genome with a size of 150 Gb. An amoeboid, Polychaos dubium might have the largest genome with a size of 670 Gb. Larger amounts of data are difficult to handle bioinformatically. Infact most assemblers would be unable to handle large amounts of data associated with these genomes. Moreover, these genomes are thought to be filled with repeats and genome duplication events making their assembly even more complicated.
Repeat content of genome: Certain genomes are known to have very high transposon activity making them rich with repetitive content. These genomes need not be large, but can still be difficult to assemble due to the almost identical copies of DNA prevalent in the genome.
Extremes in GC content: Certain genomes are known to have very high or very low GC content. This makes them difficult to sequence due to the bias involved in NGS methods. Although GC content extremes are constrained by the requirements imposed by the genetic code, Streptomyces coelicolor manages to have a GC-content of 72% while Plasmodium falciparum has just 20%. Apart from extremes in genome wide GC content, parts of the genome can have extremes in GC content making them difficult to sequence.
Rarity of sample: Some organisms are so rare, that its almost as if they were extinct. Being able to find such species and obtaining enough DNA from them can be almost impossible. The situation is made more complicated by various legal, ethical and technical issues. Rarity of sample, could also be a result of the amazingly tiny amounts of DNA available in certain species. Cultivation of many microbial species in the lab is not yet possible and obtaining enough DNA from such species has driven research in the field of metagenomics and more recently single cell sequencing. DNA from extinct species is of lower quality and filled with many artifacts making correct assembly of genomes a daunting task. However, many of these problems have been overcome by novel methods and extinct species such as the mammoth, neanderthals, denisovan.... have been sequenced and assembled to a quality comparable to that of other NGS genome assemblies.
Genome definition inconsistencies: To be able to assemble the genome of a certain species, it should be possible to define a species and what constitutes its genome. Symbiotic organisms can be difficult to delineate into distinct species, due to the high degree of inter-dependence of these species. The definition of species is a controversial subject and different interpretations of these definitions makes it controversial to claim sequencing of a particular "species".
Dynamic nature of genomes: Genomes are generally though of as stable inherited genetic material which remain exactly the same over short periods of time. However, the genomes have many dynamic features. Telomeres change in length with age in almost all species. Similarly, small viral genomes with very high mutation rates can change drastically within the span of a few hours making them completely immune to a drug or conferring a new phenotype. Such changes will require re-sequencing of the genome to identify the changes to the genome.Species with different numbers of chromosomes along a cline are another type of dynamism.

Many other genome can be difficult to assemble due to other reasons?

Wednesday, October 10, 2012

5 rules of pollex about genomics

Ewan Birney wrote a post on 10 rules of thumb in genomics almost a year ago. It is very true and will probably remain like that for "some" more time. Few more rules of pollex in no particular order

Trust the box, but know that the box is not perfect. Many steps in genomics require us to trust a black box, be it genome assemblers (that has way too much code to inspect) or simple read trimming programs. Moreover, manual verification of all data is impossible due to the shear size of the data.
Default values are not always the best values.All most all bioinformatics tool have half a dozen parameters which have some default value set in them. However, these default values might not be suitable for your data or analysis. Not being aware of the various options can have unpredictable results.
New programs keep coming up everyday.Unless the robustness of these programs has been tested by independent users, using these can be tricky.
Changes in technology are very fast. Methods keep getting better all the time and can add more to an analysis. A bit contradictory to the previous point.
Validate your results by various other methods. Artifacts are not unheard of in NGS data, and require validation from other methods.