Sunday, May 26, 2013

Trees of whole genomes, but locally

With increasing use of Next-generation sequencing methods, whole genomes of multiple species as well as closely related strains or populations are being generated. Sophisticated models to interpret this data that use our current knowledge of population dynamics and molecular processes are being used extensively.

The general idea is to be able to distinguish the population histories of various parts of the genome. The phylogenies of different parts of the genome might differ based on various functional constraints. Hence, building a single consensus tree for the entire genome will average over these local effects. To be able to overcome this limitation, many methods are being proposed to identify local phylogenies across the genome.

The program Saguaro, from Science for Life Laboratory and Broad Institute, is capable of using SNP data in VCF format or whole genome assembly alignments in MAF format to partition the genome based on local phylogenetic information. The method and its use on few example datasets is described in Unsupervised genome-wide recognition of local relationship patterns

Get gene start and end from GTF file

In a GTF file that does not have CDS records annotated, its tricky to get the start and end coordinates of a gene. This script does just that and prints a nice BED format file. It also report the genes that occur on multiple scaffolds or chromosomes as errors. These can be greped out using grep -v "Error".
 #!/usr/bin/perl  
 my $gtf1 = $ARGV[0];  
 my %scaf=();  
 my $start=();  
 my %end=();  
 my %multi=();  
 open(GTF1, $gtf1);  
 while($z=<GTF1>){   
 chomp $z;  
 my @parts=split(/\t/,$z);  
 my $start=$parts[3];  
 my $end=$parts[4];  
 my @parts2=split(/\s/,$parts[8]);  
 my $gene=$parts2[1];  
 $gene=~s/[\;\"]//g;  
 ##print "$gene\t$start\t$end\n";  
 if(!(exists $scaf{$gene})){$scaf{$gene}=$parts[0];}  
 elsif(!($scaf{$gene}=~m/$parts[0]/)){$multi{$gene}=1;}  
 if(!(exists $start{$gene})){$start{$gene}=$start;$end{$gene}=$end;}  
 else{$end{$gene}=$end;}  
 }  
 close(GTF1);  
 foreach my $keygene(sort keys %scaf){  
 if($start{$keygene}>$end{$keygene}){$temp=$start{$keygene};$start{$keygene}=$end{$keygene};$end{$keygene}=$temp;}  
 if(!(exists $multi{$keygene})){print "$scaf{$keygene}\t$start{$keygene}\t$end{$keygene}\t$keygene\n";}  
 else{print "Error:Gene $keygene on mutiple scaffolds\n";}  
 }  

Saturday, May 25, 2013

Perl script to filter sites with more than 50 percent sites missing by population

In certain cases, its required to have sites from a VCF file filtered by fraction of genotypes missing by population. While entire sites can be filtered using programs like VCFtools, its not possible to filter them by population. Here its assumed that all individuals from $ARGV[1] to $ARGV[2] are in one population and those from $ARGV[3] to $ARGV[4] are in another.

 #!/usr/bin/perl  
 use List::Util qw[min max];  
 # Input parameters  
 my $vcf_file = $ARGV[0];  
 my $firstmissing=0;  
 my $firstall=$ARGV[2]-$ARGV[1];  
 my $secondmissing=0;  
 my $secondall=$ARGV[4]-$ARGV[3];  
 #Go through fasta file, extract sequences  
 open(IN, $vcf_file);  
 while($z=<IN>) {  
 $firstmissing=0;  
 $secondmissing=0;  
      if($z=~m/^#/){print $z;}  
      else{  
           chomp $z;  
           @values=split(/\t/,$z);  
      $totind=length @values;  
      for($i=$ARGV[1];$i<=$ARGV[2];$i++){if($values[$i]=~m/\.\/\./){$firstmissing++;}}  
      for($i=$ARGV[3];$i<=$ARGV[4];$i++){if($values[$i]=~m/\.\/\./){$secondmissing++;}}  
 if((($secondmissing/$secondall)<0.5)&&(($firstmissing/$firstall)<0.5)){print "$z\n";}  
      }#end of else #  
 }#end of file while  
 close IN;  

Monday, May 20, 2013

Combining SNP data tables with perl

Continuing with our previous exploits Analyzing in vitro toxicity and expression data, its time to combine SNP data tables using perl.

#!/usr/bin/perl

open CANFAM2, $ARGV[0] or die $!;
open CANFAM3, $ARGV[1] or die $!;

%can2=();
%can3=();

while($line = <CANFAM2>){
chomp $line;
@tabs=split(/\s+/,$line);$tabs[9]=~s/[\;\"]//g;$tabs[11]=~s/[\;\"]//g;
$key=$tabs[9];
$value=$tabs[0] . "_" . $tabs[3] . "_" . $tabs[11];

if(exists $can2{$key}){print "Error:Duplicate record for SNP_ID $tabs[9]. Skipping second occurence.\n";}
else{$can2{$key}=$value;}

}#end of file while loop

while($line = <CANFAM3>){
chomp $line;
@tabs=split(/\s+/,$line);$tabs[9]=~s/[\;\"]//g;$tabs[11]=~s/[\;\"]//g;
$key=$tabs[9];
$value=$tabs[0] . "_" . $tabs[3] . "_" . $tabs[11];


if(exists $can3{$key}){print "Error:Duplicate record for SNP_ID $tabs[9]. Skipping second occurence.\n";}
else{$can3{$key}=$value;}

}#end of file while loop


print "SNP_ID\tbases\tchr_canfam2\tpos_canfam2\tchr_canfam3\tpos_canfam3\tdistance_between_versions\n";
$transitions=0;
$transversions=0;

foreach $mykey(sort keys %can2){
@can2tabs=split(/\_/,$can2{$mykey});

if($can2tabs[2]=~m/(AG|GA|CT|TC)/){$transitions++;}
elsif($can2tabs[2]=~m/(AC|CA|GT|TG)/){$transversions++;}

    if(exists $can3{$mykey}){
        @can3tabs=split(/\_/,$can3{$mykey});
            if($can2tabs[0]=~m/$can3tabs[0]/){$distance=abs($can2tabs[1]-$can3tabs[1]);}else{$distance="NA";}
        print "$mykey\t$can2tabs[2]\t$can2tabs[0]\t$can2tabs[1]\t$can3tabs[0]\t$can3tabs[1]\t$distance\n";
    }
    else {
    print "Error:SNP_ID $mykey missing in canfam3.\n";
    print "$mykey\t$can2tabs[2]\t$can2tabs[0]\t$can2tabs[1]\tNA\tNA\tNA\n";
    }
}

foreach $mykey(sort keys %can3){
    if(!(exists $can2{$mykey})){
    print "Error:SNP_ID $mykey missing in canfam2.\n";
    @can3tabs=split(/\_/,$can3{$mykey});
    print "$mykey\t$can3tabs[2]\tNA\tNA\t$can3tabs[0]\t$can3tabs[1]\tNA\n";

    if($can3tabs[2]=~m/(AG|GA|CT|TC)/){$transitions++;}
    elsif($can3tabs[2]=~m/(AC|CA|GT|TG)/){$transversions++;}
    }
}
$titv=$transitions/$transversions;
print "Info: $transitions Transistions\n";
print "Info: $transversions Transversions\n";
print "Info: $titv Ti/Tv ratio\n";

close CANFAM2;
close CANFAM3;


To do something bonus, lets calculate transitions and transversions to infer the Ti/Tv ratio.


Info: 1852527 Transistions
Info: 504318 Transversions
Info: 3.67333111251234 Ti/Tv ratio

Apart from this, 39,352 SNP's were found in only one of the two versions of the assembly. These could be attributed to gaps in the assembly that were filled at a later stage by Illumina sequencing. 

Sunday, May 19, 2013

Toxicity and expression data project

One possible solution to Analyzing in vitro toxicity and expression data perl project. Example table that is used as input to the perl script is given in below table.


"source_name_sid""casrn""chemical_name""assay_component_name""conc""value""value_type""plate_id""rep""row""col""target""time_hr""reference""media_reference""donor"
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.0041.08"fold_change"11310"E"5"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.0040.96"fold_change"11310"F"5"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.0040.94"fold_change"11310"G"5"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.0040.93"fold_change"11310"H"5"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.040.94"fold_change"1138"G"4"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.041.13"fold_change"1138"E"4"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.040.96"fold_change"1138"H"4"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.041.07"fold_change"1138"F"4"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.41.02"fold_change"1136"F"3"ABCB1"6300.5345.75778

Actual perl code:
#!/usr/bin/perl

open TOXTABLE, $ARGV[0] or die $!;

%gcct=();

while($line = <TOXTABLE>){
chomp $line;
@tabs=split(/\t/,$line);

$key=$tabs[2] . "_" . $tabs[4] . "_" . $tabs[11] . "_" . $tabs[12];
$foldchange=$tabs[5];

if(exists $gcct{$key}){if($gcct{$key}<$foldchange){$gcct{$key}=$foldchange;}}
else{$gcct{$key}=$foldchange;}

}#end of file while loop

print "chemical_name\tconctarget\ttime_hr\tvalue\n";

foreach $mykey(sort keys %gcct){
@mytabs=split(/\_/,$mykey);
print "$mytabs[0]\t$mytabs[1]\t$mytabs[2]\t$mytabs[3]\t$gcct{$mykey}\n";
}


Just run the perl script with the filename of the table as first argument and it will print "which chemical compound and at what concentration causes the highest fold change of transcription in each gene at each timepoint".

The same can be done with one line in R using :



 aggregate(M$value,by=list(M$chemical_name,M$conc,M$target,M$time_hr),max)->N  

Sunday, May 12, 2013

Defined by the unknown

How would one explain one's linear existence to a being that has no concept of time? Deep Space Nine  has a situation in which this transpires. The very philosophical nature of the entire scenario makes it a rather interesting situation.

"It is the unknown that defines our existence. We are constantly searching, not just for answers to our questions, but for new questions. We are explorers. We explore our lives, day by day, and we explore the galaxy, trying to expand the boundaries of our knowledge."

While this might be true of the utopian society that the federation succeeds in building, its far from the truth today. The search is driven more by profit, that even the Firangi would be proud of us. 

Friday, April 5, 2013

Ecological Speciation

Ecological Speciation by Patrick Nosil is a book with a red cover with a picture of a fish on its cover. An entire book dedicated to explaining "ecological speciation", its components, outstanding problems and current developments should definitely be comprehensive.  The book was written during Nosil's yearlong stay at the institute of advanced study at Berlin, Germany. It has been praised for its clarity and utility to even graduate students who are just entering the field.

While the book was published in 2012, this 2005 paper in Ecology letters gives a good premise to the book. The idea behind ecological speciation is the action of divergent selection that is due to ecological reasons which leads to the evolution of barriers to gene flow leading to the formation of new species. This process is broken down into its 3 constituents, ecologically divergent selection, barrier to gene flow or reproductive isolation and a "genetic" mechanism which links these two.

Why should the mechanism linking divergent selection and reproductive isolation be genetic? Depending on the comprehensiveness of this "genetic" mechanism, could one think of links other than pleiotropy or linkage?. For example, would a scenario that involves speciation (formation of barriers to gene flow) due to some other mechanism expedited by divergent selection be part of ecological speciation? It might be worth thinking about it while reading the chapter thats devoted to the mechanisms linking selection and isolation.

Concepts such as immigrant inviability and isolation by adaptation are also covered. Adaptive speciation is considered a special case of ecological speciation which requires frequency-dependent disruptive selection in sympatry or parapatry. Disruptive selection involves selection for opposite phenotypes within a single population while divergent selection involves selection for opposite phenotypes in different populations.

The book is made up of 9 chapters divided into 3 parts covering 280 pages. So this should definitely be a interesting read.




Saturday, March 16, 2013

Genomes are dynamic entities

Sometime ago, i was wondering about genomes that are impossible to assemble. Differences in the DNA content of Germline and somatic cells is one more oddity that can make genomes much more complex than they might appear. Roundworm genomes of the somatic and germline show a clear change from germline to somatic cells. The loss of genes expressed in the Germline suggests a role for adaptation which might have been taken over by other mechanisms such as methylation, non-coding rna and other silencers of gene expression.

The recently published genome of the lamprey is another species known to have a similar mechanism of genome change. Further studies into the mechanism of such changes, their functional role and evolutionary significance will probably provide some insight into the evolution of tissue specific expression.

The loss of genes expressed in the germline followed by addition of telomeres and location of lost genes near ends of chromosomes also introduces the possibility that loss of germline DNA could be mediated by a telomerase inhibitor such as PINX1. The idea is that PINX1 is expressed in the germline leading to DNA being chewed from the ends of the chromosome. Once, the sequence containing PINX1 is lost, it will lead to expression of telomerase and stop further loss of DNA. To test this hypothesis PINX1 should be available in germline DNA but lost in the somatic genome.

Having obtained the somatic and germline genomes from the ascaris genome website, the PINX1 gene is found to be located on scaffold "AS02549" in the somatic genome and "AG00021" in the germline genome. Hence, it seems unlikely to be the germline DNA loss mechanism. However, its still possible that a trans regulatory element might be lost or some other telomerase inhibitor might be involved. The availability of gene expression data allows one to test this hypothesis. The fact that neither telomerase nor PINX1 is seen in the list of differentialy expressed genes suggests that a different mechanism might be in action.


Thursday, February 28, 2013

unix sort on multiple columns

For the nth time today, i had to think how to do unix sort on multiple columns either based on numbers or alphabets. Fortunately, i had a coffee before, so i figured out that for sorting a file that looks like this:


chr01 1
chr01 10
chr01 2
chr01 5
chr01 8
chr02 9
chr02 9
chr02 19
chr02 98
chr02 92

you can run the command "sort -k1,1 -k2n,2 sorttest" from GNU coreutilities to get something that looks like this:


chr01 1
chr01 2
chr01 5
chr01 8
chr01 10
chr02 9
chr02 9
chr02 19
chr02 92
chr02 98

and run the command "sort -k1,1 -k2,2 sorttest" to get something that looks like this:

chr01 1
chr01 10
chr01 2
chr01 5
chr01 8
chr02 19
chr02 9
chr02 9
chr02 92
chr02 98

All commands were run using "GNU coreutils 8.12.197-032bb" compiled on September 2011 as part of the Ubuntu OS. 







Saturday, February 16, 2013

The process - inspired by Eugene Myers

Heard the LinnĂ© lecture by Eugene Myers about "Molecular Cell Biology via Bio-Image Informatics" and was so inspired by it, that had to blog about it. Apart being on the author list of the "blast" paper which has over 40,000 citations, he was part of the Human, mouse and fly genomes. 

Well, it seems the amounts of data generated by these projects is not enough for him! So he has decided to work with bio-image informatics, which involves interpretation of images generated by staining different features of species. Location of proteins within cells, location of cells within organs and movement of mouse whiskers in response to stimuli are some of the things he talked about. 

"Technology--->Experiment----> Analysis--->Model" is the process that he talked about. His main interest being in the experiment. Be it an experiment that uses Sanger sequencing technology to generate a human genome which is used build models of human genetics or an experiment that uses powerful microscopy technologies to analyse the relative location of cells in a organism to build a model of development and function. The clarity of his thought and expression was amazing, it is probably how a genius works.




Tuesday, February 12, 2013

Pleiotropy - mutations tend to affect more than 1 phenotypic character

As a continuation of the previous post on the Fundamental concepts in genetics series here is the next topic : "The pleiotropic structure of the genotype–phenotype map: the evolvability of complex organisms". The paper starts of with a description of the goals of genetics and how the Genotype-Phenotype map (GMP) is very useful in understanding genetics.

While pleiotropy has been sometimes defined as changes in one gene affecting traits that are seemingly unrelated, the part about "seemingly unrelated" is dropped probably as it is difficult to define what seems to be related or not.

The cost of complexity hypothesis postulates that complex organisms are fundamentally less evolvable compared to simpler organisms as complex organisms are more pleiotropic. Fisher's geometric model and the microscope analogy are described. While theoretical predictions are available, empirical data has only become available recently. Measurement of pleiotropy and the extent and patterns of pleiotropy are being studies with the datasets that have been generated. 

Measurement of pleiotropy is complicated by at least 2 different cases:
  1. Closely linked genes affecting 2 different traits tend to be co-inherited and can be appear to be due to pleiotropic effects of a single gene.
  2. Shared cis-regulatory element of 2 genes will affect the traits controlled by both genes. This has been called artefactual pleiotropy as pleiotropy can be defined as a character of a genes rather than that of a mutation.
 Apart from the technical difficulties involved is measuring pleiotropic effects, conceptual problems like the definition of a "phenotypic trait or character" make it difficult to have an objective measure of pleiotropy.
  • QTL data tend to overestimate pleiotropic effects due to biases caused due to linked genes
  • Gene knockout and knockdown experiments avoid the problem of closely linked genes but is only able it only measures mutations that lead to complete loss of gene function
  • Mesures of pleiotropy depend on the number and type of traits that are measured in an experiment
  • Traits that are beyond the detection limits of methods used to measure traits also tend to lead to a distorted picture of pleiotropy
  • Theoretical methods to estimate pleiotropy have used the relationship between the genetic load, effective population size and effective dimensionality of the phenotype (average pleiotropy) 
  • Distribution of fitness effects have also been used to predict phenotypic complexity based on predictions of FGM
Correlation among the traits also introduces many issues in estimation of pleiotropy.However, while universal pleiotropy was considered to be the norm, its being increasingly argued that the extent of pleiotropy is very minimal(variational modularity or restricted pleiotropy). With the datasets from Yeast, C.Elegans etc.. it is being seen that the degree of pleiotropy even with the upward biased estimates is rather low. 

While the molecular basis of Pleiotropy is not well studied, two types of pleiotropy were suggested by Gruneberg way back in 1938. Type I (initially called genuine) pleiotropy refers to multiple molecular functions of a single gene product. Type II (initially called spurious) pleiotropy refers to multiple morphological & physiological effects of a single molecular function. While, recent studies have shown type II pleiotropy to be most prevalent. Hence, the pleiotropy seen today could be a result of new biological functions being assigned to the same genes which still have the same molecular functions. However, the extent of pleiotropy is not something that is settled and will probably require a lot of work that would require solving the various issues involved in measuring and analyzing pleiotropy.


Unix for foreach loop with range and vector

To loop through values in unix or linux bash shell one can make use of the "for" loop.

Example: Specify range
 
for i in {1..24};
do
echo $i
done

One can also use C style for loops like

for ((i=1;i<=25;i++));
do
echo $i
done

It can be used like the perl's foreach loop by specifying an array. Note that range can also be descending. 


for i in {1..24} 25 50 {3..4} 100 150 {5..2} qw er st {a..z};
do
echo $i
done

You can also have strings and string range not just numbers.

Sunday, February 3, 2013

Epistasis - interaction between genes


Why are crows black? It seems the answer has two components, the first being the mechanistic and easier to answer while the second is evolutionary. While it might not be the ultimate answer (which is of course 42), it does go much further towards the "why" than just the "how".

Epistasis is simply defined as "interaction between genes". However, in his earlier paper "The Language of Gene Interaction" he explains the 2 slightly different ways the term has been used.

William Bateson is credited with "inventing" the term in 1908-09 to explain the disagreement between the segregation ratios expected based on the action of separate genes and the actual results of a dihybrid cross. The action of one locus(epistatic) masking the effects of alleles at another locus (hypostatic) gives the effect of the epistatic locus "standing upon" the hypostatic locus. However, the term has expanded to cover, 
  1. Functional relationship between genes--Functional epistasis--protein-protein interactions
  2. Genetic ordering of pathways--Compositional epistasis--"measures the effects of allele substitution against a particular fixed genetic background"
  3. The quantitative differences of allele-specific effects--Statistical epistasis--"measures the average effect of allele substitution against the population average genetic background"

Epistasis as a Tool- Flower color in sweet peas

Non-Mendelian  segregation ratio of 9:7 in the cross of two white flowers to produce violet flowers has been attributed to mutations in 2 different genes in the anthocyanin pathway. This framework has been extended to elucidate the order of genes in pathways by using knock-out mutations.

High-throughput approaches that test the effects of all possible combinations of genes in an organism are being done using comprehensive deletion and knockdown libraries along with high-throughput maintenance and screening methods. Both qualitative and quantitative experiments have tested different number of gene knockouts with various genetic backgrounds to understand the interactions between genes. However, the actual number of possible interactions is still a limiting factor. Moreover, interactions need not be a simple presence-absence effect, different expression levels, mutations at various positions in a gene etc can produce other possible combinations of interactions. While knockout libraries have been widely used in Yeast, it is not as easily applicable to many other systems. RNAi knockdown libraries have been used to overcome these limitations. 

Presence of epistatic interactions has been an obstacle to the mapping of many traits to their genes. Recently, few traits with epistatic effects have been identified in human disease genetics. Evolutionary basis of the advent of epistatic effects has been explained by atleast 3 different models. Further analysis of cases involving epistasis is required to understand the role played by evolution in generating or driving evolution.




Tuesday, January 8, 2013

Let It Be

While we have to "Imagine" , we also have to "Let It Be". Much more popular than other songs of the Beatles, this song says "Mother Mary comes to me Speaking words of wisdom". One could argue, that songs popularity comes from its religious connotation, but could also be due to its "wisdom" coming from a mother(Paul's). Another, interpretation i have heard is "Mary" being a herb!! This also makes sense, as "And in my hour of darkness she is standing right in front of me".

This song offers a solution, "And when the broken hearted people living in the world agree There will be an answer". So we can wait for the "broken hearted" to agree and "Let it be".


Sunday, January 6, 2013

Imagine there is no heaven

"Imagine" as its so often called refers to a Song that has been interpreted in various ways, criticized,  revered and even worshipped. The deceptively simple to understand lyrics and soothing music have probably contributed to its popularity in the 20th Century.

While many people have equated it to a communist recruitment song, it may indeed be a much bigger idea. In fact, you can imagine,whatever you want the song to mean. 

The very idea of a world with "Nothing to kill or die for" is so appealing to any human being irrespective of her religion, political ideology, country etc... its surprising that we still live in a world torn apart by war and strife. 

This probably is due to the human need to have something to kill or die for, a purpose for life even. Unless we have a different more attractive reason to "live for", we may just have to be content with dreams.

Only part of the song that baffles me is a "A brotherhood of man", why not a sisterhood of women? or better still a "Fellowship of Humanity"(this is actually the name of a church). One can only hope that, Lenon used the word brotherhood to mean "a group of people or all people"

A world with "no possessions" can be "Imagine"d today only as part of communism. Human pursuit of materialism can just be that "a pursuit". The irony of it was "probably" not lost on Thomas Jefferson who put "Life, liberty and the pursuit of happiness" in the US declaration of Independence. As the human race moves ahead it may "someday" come to realise that as long as we have greed we will have hunger "And the world will live as one".

Tuesday, October 16, 2012

Plant MicroRNA validation

Identification and validation of plant MicroRNA has been summarized by Meyers et.al., in the article Criteria for Annotation of Plant MicroRNAs. They build upon the previous article by Ambros et. al., which had expression and biogenesis criteria.

The expression criteria were:

  1. Hybridization of a "specific" RNA probe to a size selected RNA sample, using a method like Northern blotting.
  2. Identification of the "specific" RNA sequence in a size selected cDNA library. It is expected that this sequence matches the genomic sequence of the organism from which they were cloned.Various sequencing technologies have been used to generate size selected cDNA libraries.
 The Biogenesis criteria were:
  1. Structure prediction should support a fold-back precursor that contains the "specific" miRNA sequence within one arm of the hair-pin. Morever, this hairpin should have the lowest free energy as per a RNA-folding program like mfold "and must include at least 16 bp involving the first 22 nt of the miRNA and the other arm of the hairpin". Apart from meeting these criteria, the hairpin should also be free of any internal loops or large asymmetric bulges.
  2. The "specific" sequence and its predicted precursor fold-back secondary structure should be conserved.
  3. Increased accumulation of precursor in organisms with reduced Dicer function. 
Meeting any of these criteria individually is not sufficient to be considered a miRNA as even siRNA's meet the expression criteria and biogenesis criteria are not specific to miRNA's. Hence, both expression and biogenesis criteria are required for proper validation of miRNA as per Ambrose et.al., The more recent (2008) set of guidelines by Meyers et. al., tries to utilize the knowledge gained by studies in the 5 years since the initial criteria by Ambrose et.al in 2003.

The criteria set up by Meyers et. al., are grouped under primary and ancillary criteria with various precautions to be taken. It also has information about assigning miRNA's to families.

Primary criteria include presence of miRNA sequence in cDNA library and its validation by hybridization experiments and adherence to the expected stem-loop structure. Datasets with low coverage are to be treated with caution as they have chance of mistaking a siRNA as a miRNA.

While ancillary criteria can support a miRNA candidate that meets the primary criteria, it is considered neither necessary nor sufficient for a prediction. However, "clear" conservation is generally sufficient to annotate an miRNA, as long as it has satisfied the primary criteria in the organism in which it has an homolog.  Other criteria such as miRNA target prediction, DCL1 dependence, RDR and PolIV, PolV independence are useful for obtaining biological function information but not enough to validate a sequence as miRNA.

Recently, many new automated pipelines have been published for automating the process of miRNA identification, validation, annotation and target prediction.
  1. miRTour has a easy to use web interface to upload the EST/contigs but it has a 50Mb limit on the size of the dataset that can be uploaded.
  2. PIPmiR (Pipeline for the Identification of Plant miRNAs) provides an executable that can be downloaded and used.It has been used to predict both known and novel miRNAs in Arabidopsis.
  3. shortran: a pipeline for small RNA-seq data analysis is also available for download. 
  4. miRDeep-P is a version of miRDeep modified for use with plant transcriptomes.
  5. sRNA toolkit is a more general solution that identifies not only miRNA but other small RNA's.
The availability of numerous automated pipelines for plant miRNAs can be useful, but at the same time introduce its own set of artifactual problems.

Monday, October 15, 2012

Impossible genomes?

Few genomes are difficult (with current state of technology) to assemble due to their bizzare characteristics. The high cost of sanger sequencing, construction of FOSMID or BAC libraries, flow sorting of chromosomes and other wonderful methods makes the assembly of genomes with NGS methods difficult. However, even Sanger based methods find it difficult to sequence some genomes that are almost impossible to assemble to "completion". Genomes can be difficult due to reasons such as:

  1. Large size of genome: A genome that is very large (has many many bases) are difficult to sequence, mainly due to the higher costs involved in generating sufficient coverage. Largest known vertebrate genome is that of the Lungfish with a size of 133 Gb and canopy plant being the largest known plant genome with a size of 150 Gb. An amoeboid, Polychaos dubium might have the largest genome with a size of 670 Gb. Larger amounts of data are difficult to handle bioinformatically. Infact most assemblers would be unable to handle large amounts of data associated with these genomes. Moreover, these genomes are thought to be filled with repeats and genome duplication events making their assembly even more complicated.
  2. Repeat content of genome: Certain genomes are known to have very high transposon activity making them rich with repetitive content. These genomes need not be large, but can still be difficult to assemble due to the almost identical copies of DNA prevalent in the genome.
  3. Extremes in GC content: Certain genomes are known to have very high or very low GC content. This makes them difficult to sequence due to the bias involved in NGS methods. Although GC content extremes are constrained by the requirements imposed by the genetic code, Streptomyces coelicolor manages to have a GC-content of 72% while Plasmodium falciparum has just 20%. Apart from extremes in genome wide GC content, parts of the genome can have extremes in GC content making them difficult to sequence.
  4. Rarity of sample: Some organisms are so rare, that its almost as if they were extinct. Being able to find such species and obtaining enough DNA from them can be almost impossible. The situation is made more complicated by various legal, ethical and technical issues. Rarity of sample, could also be a result of the amazingly tiny amounts of DNA available in certain species. Cultivation of many microbial species in the lab is not yet possible and obtaining enough DNA from such species has driven research in the field of metagenomics and more recently single cell sequencing. DNA from extinct species is of lower quality and filled with many artifacts making correct assembly of genomes a daunting task. However, many of these problems have been overcome by novel methods and extinct species such as the mammoth, neanderthals, denisovan.... have been sequenced and assembled to a quality comparable to that of other NGS genome assemblies.
  5. Genome definition inconsistencies: To be able to assemble the genome of a certain species, it should be possible to define a species and what constitutes its genome. Symbiotic organisms can be difficult to delineate into distinct species, due to the high degree of inter-dependence of these species. The definition of species is a controversial subject and different interpretations of these definitions makes it controversial to claim sequencing of a particular "species".
  6. Dynamic nature of genomes: Genomes are generally though of as stable inherited genetic material which remain exactly the same over short periods of time. However, the genomes have many dynamic features. Telomeres change in length with age in almost all species. Similarly, small viral genomes with very high mutation rates can change drastically within the span of a few hours making them completely immune to a drug or conferring a new phenotype. Such changes will require re-sequencing of the genome to identify the changes to the genome.Species with different numbers of chromosomes along a cline are another type of dynamism.
Many other genome can be difficult to assemble due to other reasons?

Wednesday, October 10, 2012

5 rules of pollex about genomics

Ewan Birney wrote a post on 10 rules of thumb in genomics almost a year ago. It is very true and will probably remain like that for "some" more time. Few more rules of pollex in no particular order

  1. Trust the box, but know that the box is not perfect. Many steps in genomics require us to trust a black box, be it genome assemblers (that has way too much code to inspect) or simple read trimming programs. Moreover, manual verification of all data is impossible due to the shear size of the data.
  2. Default values are not always the best values.All most all bioinformatics tool have half a dozen parameters which have some default value set in them. However, these default values might not be suitable for your data or analysis. Not being aware of the various options can have unpredictable results.
  3. New programs keep coming up everyday.Unless the robustness of these programs has been tested by independent users, using these can be tricky.
  4. Changes in technology are very fast. Methods keep getting better all the time and can add more to an analysis. A bit contradictory to the previous point.
  5. Validate your results by various other methods. Artifacts are not unheard of in NGS data, and require validation from other methods.


Tuesday, September 25, 2012

The story of missing genes

The recent explosion in the number of genomes being sequenced has lead to interesting experiments that use genome assemblies to test hypothesis ranging from gene loss and gain, pathway enrichment analysis, gene repertoire evolution etc. However, many reviews that have analysed NGS assemblies have shown that genome assembly quality and annotation methodology determines gene content.

Human genome assemblies generated with the NGS data are completely missing more than 83 genes available in the Sanger genome assemblies. While no single functional category of genes are missing from the assemblies, genes located in regions difficult to assemble are lost often. MHC genes are notorious for being difficult to assemble. Analysis of MHC region using draft assemblies can lead to underestimating or predicting the loss genes in a lineage due to poor assembly quality.

Various programs have been developed to find these "missing genes" that are either lost in gaps in the assembly or too fragmented to be readily recognizable or even miss-assembled. For example, IMAGE genome assembler tries to find the parts that are lost in gaps. SOAP Gap Closer also closes gaps in the assembly. GAP Filler from the makers of SSPACE claims to be superior to both IMAGE and SOAP Gap closer in its ability to better predict gap sizes and also use multiple libraries without a corresponding increase in memory usage. Being able to replace N's in genome assemblies with the "correct bases" is very useful in being able to recover various features of the genome which were essentially lost in gaps.

Longer pacbio read based assembly of budgerigar genome is probably a way out of these poor assemblies. Genes like FOXP2 were found to be fragmented in assemblies that did not use the PBcR (PacBio) reads. These fragmented genes could be recovered in the assemblies that used PBcR reads. However, few regions needed a combination of different technologies to be able to recover certain regions. Its not surprising that NGS assemblies have so many missing bits when even the "completed" human genome is still being updated 13 years after being published. Advances not only in sequencing technology but also our understanding of the diversity in the genome and its representation will probably change the way we think of a genome assembly in the future.

Monday, September 24, 2012

Christian analogies in biology

Its amusing at times to find Christian analogies in biology (CAB) being misinterpreted and misused to the amusement of some and frustration of others. Hope many more such analogies are used in the future leading to much more bemusement. The inter-related nature of religion means that these words are claimed by multiple religious groups.

The purpose of this post is just to list these analogies to serve as some sort of a mnemonic. Think of them as bible analogies in biology (BAB) or Christian analogies in biology (CAB) or whatever else you fancy.
  1. Mitochondrial Eve and Y-chromosomal Adam: Used to denote the MRCA based on the mitochondria and Y-chromosome. Based on Adam and Eve, the first humans in the Bible.
  2. Lazarus taxon:Used to denote species that are missing from fossil records for some time and then re-appear at a later stage. Based on Lazarus who is restored to life by Jesus after being dead in the Gospel of John.
Any more?

RNA-seq challenges and strategies

RNA-seq is being used by a diverse group of users for investigating very diverse problems. De novo assembly has been seen as easy to use and effective way to utilize the power of RNA-sequencing in non-model organisms. While genome assembly programs have been investigated by various reviews and others such as ASEMBLETHON and GAGE, de novo transcriptome assemblers have not been investigated in great detail. We looked at de novo transcriptome assemblies generated by Illumina assemblers from many perspectives. Apart from the obvious structural error characterization, other factors such as sequencing error, polymorphism (pi), paralogs to name a few. 

Using real data would be ideal to test programs, however, in this case it becomes difficult to distinguish artifacts from novel biology, so simulation is a good idea to perform some sort of validation and quantification of errors of various types.

The simulation paper on RNA-seq for the special issue on Next-gen sequencing is finally available online : Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA-seq experiments.


Tuesday, September 11, 2012

Why god is so fond of beetles?

Beetles form insect order Coleoptera, which is very rich in diversity. JBS Haldane's quote "that god has an inordinate fondness for beetles" is used to portray this rather astonishing fact. Various experiments have been done to identify the reason for the larger diversity found in beetles. 

Reasons such as host specificity driven speciation, specifically herbivory and angiosperm evolution in particular is suspected.While this study has contradicted this line of reasoning based on large scale phylogenies.

Higher survival capabilities of the beetle lineages has been attributed to their pre-Cretaceous origins. Was this diversity a result of an adaptive radiation that has been sustained by further diversification into a variety of niches? Does the beetle have a better ability to adapt? What makes it more adaptable?

Do the reasons for the diversification come from predators and defense mechanisms that beetles had to develop to escape these predators. In fact beetles use mimicry, chemical defenses, strong mandibles or horns or spines to deter predators.

Sex chromosomes of beetles have interesting biology and could play a role in the generating diversity.Frequent changes in sex determination mechanisms as has been seen in many insect groups could be important.

Could the genome shed some light on the reasons? The genome of the red flour beetle (Tribolium castaneum) has been sequenced. Presence of genes involved in detoxification, development, cell-cell communication and various other genes involved in ability to interact with diverse environments might explain the extraordinary ability of the beetles to adapt.

On the contrary Haldane was a self professed atheist who thought that "My practice as a scientist is atheistic. That is to say, when I set up an experiment I assume no god, angel or devil is going to interfere with its course... I should therefore be intellectually dishonest if I were not also atheistic in the affairs of the world.". So may be he was making a more complicated point which we mere mortals have failed to comprehend?

Tuesday, September 4, 2012

Cold dog Stockholm Sweden

The lion statue in Stockholm although not famous is known to people who visit the palace. The cold dog with rags drapped around it is almost not seen squatting in a corner.

Cold dog Stockholm, Sweden
Even though its not as majestic as the lion or one of the kings who ruled Sweden, its amazing in its own way. The rags which look so realistic are made out of metal, but on the first look very realistic. It is at home in the cold winters but even in the summer it is not out of place.

Get duplicate fasta sequences

ALLPATHS-LG had the DEDUP option turned off by default prior to release 42726. So this bit of code identifies exact duplicates either on the same or negative strand. Many of my runs had 30 to 60 Kb of sequence that was duplicated with almost equal amounts on both strands.

 #!/usr/bin/perl  
 use warnings;  
 # Input parameters  
 open FASTA, $ARGV[0] or die $!;  
 my $seqst_temp="";  
 my $seqs = {GENENAME =>my $genename,LEN =>my $qcom};  
 while($line = <FASTA>){  
 if($line=~ /^>/){  
 if($header){  
 if(exists $seqs{$seqst_temp}{GENENAME}){print "$seqs{$seqst_temp}{GENENAME}\t$header\t$seqs{$seqst_temp}{LEN}\n";}  
 $rseqst_temp = $seqst_temp;  
 $rseqst_temp=revcomp($rseqst_temp);  
 if(exists $seqs{$rseqst_temp}{GENENAME}){print "$seqs{$rseqst_temp}{GENENAME}\t$header\t$seqs{$rseqst_temp}{LEN}\treverse\n";}  
 $seqs{$seqst_temp}{GENENAME}=$header;  
 $seqs{$seqst_temp}{LEN}=length $seqst_temp;  
 }  
 chomp $line;  
 $header="";  
 $header=$line;  
 $seqst_temp="";  
 $rseqst_temp="";  
 }  
 else{  
 $line =~ s/[\n\t\f\r_0-9\s]//g;  
 $seqst_temp .= $line;  
 }  
 }#end of while loop  
 if($header){  
 if(exists $seqs{$seqst_temp}{GENENAME}){print "$seqs{$seqst_temp}{GENENAME}\t$header\t$seqs{$seqst_temp}{LEN}\n";}  
 $rseqst_temp = $seqst_temp;  
 $rseqst_temp=revcomp($rseqst_temp);  
 if(exists $seqs{$rseqst_temp}{GENENAME}){print "$seqs{$rseqst_temp}{GENENAME}\t$header\t$seqs{$rseqst_temp}{LEN}\treverse\n";}  
 $seqs{$seqst_temp}{GENENAME}=$header;  
 $seqs{$seqst_temp}{LEN}=length $seqst_temp;  
 }  
 close FASTA;  
 sub revcomp{  
     my $input = shift;  
     my $revcomp = reverse($input);  
     $revcomp =~ tr/ACGTacgt/TGCAtgca/;  
     return $revcomp;  
 }  

Although this does the job, its ugly in having the same bit of code used twice, once within the loop and once after the while loop. May be the end of file can be handled more elegantly.

Wednesday, March 28, 2012

MD5 file list checker

This program reads a file with filenames with correct path and MD5 values separated by a tab character and checks if the MD5 is correct or not.

#!/usr/bin/perl
use strict;
use Digest::MD5 qw(md5_base64);

open MD5, $ARGV[0] or die $!;
my $line="";
while($line = ){
chomp $line;
my @files=split(/[ \t]+/,$line);
open(FILE, $files[1]) or die "Can't find file $files[1]\n";
my $digobj = Digest::MD5->new;
$digobj->addfile(*FILE);
$digest = $digobj->hexdigest;
close(FILE);
if($digest!~m/$files[0]/){print "Md5 does not match for file:" . $files[1];}
else{
print "Md5 match for file:$files[1]\n";
print $digest ."\n" . $files[0] . "\n";
}
}

It would make sense to have MD5 checks integrated into the OS and have a MD5 list in each folder. May be it will get added into the code of various programs like ftp, sftp or even ordinary copy and mv.

Sort fasta file

Many programs like GATK require the fasta files to be sorted before use. Here is a rather simple script for the job:

 #!/usr/bin/perl  
 open FASTA, $ARGV[0] or die $!;  
 my $temp="";  
 my $seqs = {SEQ =>my $fheader};  
 my $sortemp="";  
 while($line = <FASTA> ){  
 if($line=~ /^>/){  
 if($header){$seqs{$header}{SEQ}=$temp;}  
 chomp $line;  
 $header="";  
 $line =~ s/[\s]/_/g;  
 $header=$line;  
 $temp="";  
 }  
 else{$line =~ s/[\n\t\f\r_0-9\s]//g;$temp .= $line;}  
 }#end of while loop  
 if($header){$seqs{$header}{SEQ}=$temp;}  
 close FASTA;  
 foreach $sortemp (sort keys %seqs) {  
 print "$sortemp\n";  
 print "$seqs{$sortemp}{SEQ}\n";  
 }  

However, you can find more elegant solutions that use Bioperl at Wolf/Takebayashi lab.

Saturday, August 20, 2011

Why you cannot give Feedback to the LS?

The recent Anna Hazare movement has been opposed by many. Intellectuals like Nandan Nilekani among many others wanted the use better ways to comment or provide feedback to the parliament.

I wonder if Mr Nilekani has ever tried using the feedback form on the parliament of India website. The website which is designed and hosted by National Informatics Centre has a very interesting feature indeed :)

It has excellent code which makes sure that feedback can never be submitted. If you don't believe me you can try it yourself. As soon as you type anything in the feedback form the javascript onkey event calls the validate function. This function is supposed to replace all special characters. However, the brilliant person who designed this included the ^ (caret) character in the "myC" array that lists all special characters. This results in all possible characters being matched by the Regular expression. Hence, anything thats typed into the feedback form gets deleted.

If this is the situation in a website that can be easily verified by anybody with access to the internet. We can only imagine what happens to the feedback which actually manages to reach the Politicians after manging to escape the caret.

Monday, July 4, 2011

English? Inglish? Or watever this is!!


Recently, i found this(See picture) stamped on a bill from westside. It says, "No Exchange on Sale Merchandise". My first thought was it meant "No Exchange on Sold Merchandise". Would this qualify for Inglish? May be the BBC would find it interesting enough to publish? A similar collection of images about chinglish was published few years ago.

The English may find that their language is being ruined. But is that the price that must be paid to be an international language? How much can a language be stretched and modified before it stops being itself?

After further thought, it seemed that it actually meant "No Exchange on merchandise which is on 'Sale(Being sold at a lower price)' ". May be it is self evident as a measure to save on print space.

Wednesday, June 1, 2011

Hyperboloids of wondrous Light

" Hyperboloids of wondrous Light

Rolling for aye through Space and Time

Harbour those Waves which somehow Might

Play out God's holy pantomime”

This is what it says on the gravestone of Alan Turing, the code breaker who was involved in breaking the Enigma's cryptic codes. I like to think of these hyperboloids as the very essence of the universe. An explanation for the existence of our universe and development of intelligent life.

Many would agree that the "genome" is indeed God's holy pantomime. It will not be surprising that it plays out not just in the living world but also in most things in the universe. A mathematical explanation for the way things happen. How probable is the development of intelligent life? Is the universe bound by rules that direct the formation of intelligent life? These are some questions that could shed light on the oldest questions that man has sought answers to.

Turing does feature on the list of atheists, although its not clear when and what changed his thoughts. However, its probably worth noting that the "God" here would most likely not refer to a god associated with any religious denomination. A spiritual god cannot be ruled out, but most likely refers to a mathematical god, one of  physical principles.

Wednesday, April 20, 2011

The problem with walls

It is very common to have walls surrounding most buildings in tropical countries. However, walls are rare and fences are more common in the colder(LOL) countries. Is it just a question of style or does it have anything to do with the climate?

Walls would make it easier for the snow to pileup around a building making it inaccessible. Removal of snow would also be complicated by the presence of solid structures hidden below layers of snow. Such obstructions would have to be properly marked, if heavy machinery is to be used for clearing the snow. Moreover, as the snow melts, walls without proper drainage could cause the formation of puddles and stagnation of water. These seem reasonable enough to have fences instead of walls.

Forts had walls to act as a barrier to invading armies. With the advent of modern warfare, walls have lost their significance as defenses against human armies. Cattle and wild animals can be contained within or outside walls.

Thursday, March 3, 2011

Break scaffolds into Mate pairs

Could not find any programs that can merge a set of scaffolds. Hence, decided to break the scaffolds into mate pairs and use those for scaffolding. Below is a simple perl script to break scaffolds into all possible mate pairs in a format suitable for SSPACE (a hierarchical scaffolder). It can easily be modified to work for BAMBUS or any other scaffolder or assembler.


#!/usr/bin/perl

#AUTHOR
# Nagarjun Vijay (c) 2011
# mate maker.pl - program to break scaffolds into mate pairs
#

#DATE
# 3rd March 2011
#

use strict;

my $scafile=$ARGV[0];

#reading the scaffold sequences
open SCAF, "<",$scafile or die $!;

#open contigs file
open CONTIG, ">>",$scafile . ".contigs.fa" or die $!;


#open contigs file
open LIB, ">>",$scafile . ".lib" or die $!;

my ($scaffoldcount,$seqst_temp,$contigcount,$notscaffoldcount,$z,$head,$i,$j,$libcount);
$scaffoldcount=0,$seqst_temp="",$contigcount=0,$notscaffoldcount=0,$libcount=0;
my $header = ;#reading the first header

while($z = ){
if($z=~ />/)
{
#split scaffold into contigs
if($seqst_temp=~ m/[N]+/){#is really a scaffold
$scaffoldcount++;
$contigcount +=SplitScaffold($seqst_temp);
}#end of scaffold check if
else{#not a scaffold - so not writing into breaker file
$head=">unbreak";
$notscaffoldcount++;
print CONTIG $head . "" . $notscaffoldcount ."\n";
print CONTIG "$seqst_temp\n";
}#end of scaffold check else

#next sequence
$seqst_temp="";
}
else
{
$z =~ s/[\n\t\f\r_0-9\s]//g;
$seqst_temp .= $z;
}
}#end of File while loop

if($seqst_temp=~ m/[N]+/){#is really a scaffold
$scaffoldcount++;
$contigcount +=SplitScaffold($seqst_temp);
}#end of scaffold check if
else{#not a scaffold - so not writing into breaker file
$head=">unbreak";
$notscaffoldcount++;
print CONTIG $head . "" . $notscaffoldcount ."\n";
print CONTIG "$seqst_temp\n";
}#end of scaffold check else
close SCAF;
close CONTIG;

#Split scaffolds, write contigs and breaker file
sub SplitScaffold{
my $seq = shift;
my ($head,$num,$contigs,$n);
$head=">breaker$scaffoldcount";
$num=1,$contigs=1;
my @values=split(/([N]+)/,$seq);
my $vallen= scalar (@values);
for($i=0;$i<$vallen;$i++) {
my $val=$values[$i];
if(($num % 2) == 0){
$n = length $val;
}
else{
print CONTIG $head . "" . $contigs ."\n";
print CONTIG "$val\n";
$contigs++;
my $matedist=(length $values[$i])+(length $values[$i+1])+(length $values[$i+2]);
for($j=$i+2;$j<$vallen;$j=$j+2){
print LIB "lib".$libcount." l".$libcount."s".$scaffoldcount."b".$i.".fasta l".$libcount."s".$scaffoldcount."b".$j.".fasta ".$matedist." 0.75 0\n";
$matedist=$matedist+(length $values[$j+1])+(length $values[$j+2]);
open READ1, ">>","l".$libcount."s".$scaffoldcount."b".$i.".fasta" or die $!;
open READ2, ">>","l".$libcount."s".$scaffoldcount."b".$j.".fasta" or die $!;
print READ1 ">l".$libcount."s".$scaffoldcount."b".$i."\n";
print READ1 $values[$i]."\n";
print READ2 ">l".$libcount."s".$scaffoldcount."b".$j."\n";
print READ2 reverseComplement($values[$j])."\n";
close READ1;
close READ2;
$libcount++;
}

}
$num++;
}

return $contigs;
}
#reverse complement a sequence
sub reverseComplement{
$_ = shift;
tr/ATGC/TACG/;
return (reverse());
}
#end of Breaker

Tuesday, February 22, 2011

convert Fasta files into TIGR .contig file

During the process of converting Fasta files with gaps to AFG format, one step required generation of TIGR .contig format. Given below is a simple script to generate a dummy .contig file which has just one read for each contig which is exactly the same as the contig. This is done due to the lack of read tracking in SOAP, SSPACE etc. and also to avoid the huge file size if all the reads are retained.

 #!/usr/bin/perl  
 open FASTA, "< $ARGV[0]" or die "Can't open $ARGV[0] ($!)\n";  
 while($z=<FASTA>){  
  if($z=~ />/)  
  {  
  chomp $z;  
  $header=(split(/>/,$z))[1];  
  $readid=(split(/_/,$header))[1];  
  $seq=;  
  chomp $seq;  
  $seqlen=length $seq;  
  print "##$header 1 $seqlen bases, 00000000 checksum.\n";  
  print "$seq\n";  
  print "#$readid(0) [] $seqlen bases, 00000000 checksum. {$seqlen 0} <1>\n";  
  print "$seq\n";  
  }  
 }  

save it as fasta2contig.pl and use as "perl fasta2contig.pl contig.fa"

Thursday, January 20, 2011

Minimus2 larger datasets with Nucmer

While Blat is much more faster than Nucmer, its results can be very different from the results from Nucmer. A version of Minimus2 which splits the data before feeding it to Nucmer and merging it back before processing further is able to give a result which is almost always identical to the results from Nucmer. The changes required to be done are given below:

NUCMER = nucmer
DELTAFILTER = delta-filter
SHOWCOORDS = show-coords
RUNNUCMER = perl runnucmer.pl
#------------------------------------------------------------------------------#

## Building AMOS bank & Dumping reads
10: rm -fr $(BANK)
11: $(BINDIR)/bank-transact -c -z -b $(BANK) -m $(TGT)
12: $(BINDIR)/dumpreads $(BANK) -M $(REFCOUNT) > $(REFSEQ)
13: $(BINDIR)/dumpreads $(BANK) -m $(REFCOUNT) > $(QRYSEQ)

## Getting overlaps
20: $(RUNNUCMER) $(OVERLAP) $(REFSEQ) $(QRYSEQ) $(PREFIX)
21: $(SHOWCOORDS) -H -c -l -o -r -I $(MINID) $(ALIGN) | $(BINDIR)/nucmerAnnotate | egrep 'BEGIN|END|CONTAIN|IDENTITY' > $(COORDS)
22: $(BINDIR)/nucmer2ovl -ignore $(MAXTRIM) -tab $(COORDS) | $(BINDIR)/sort2 > $(OVLTAB)


The code for runnucmer.pl is given below:

#!/usr/bin/perl

$allowedsize=100000000;

$overlap=$ARGV[0];
$mainref=$ARGV[1];
$mainqry=$ARGV[2];
$mainpre=$ARGV[3];
$reflen=0;
$refcount=1;

$qrylen=0;
$qrycount=1;

#open first ref file
open FILE1, ">>",$mainpre."ref".$refcount or die $!;

#reading the refrence sequences
open FILE2, "<",$mainref or die $!;


$seqst_temp="";
$header = ;#reading the first header

while($z = ){
if($z=~ />/)
{#next sequence
$reflen=$reflen+length $seqst_temp;
if($reflen < $allowedsize)
{
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
else
{
$refcount++;
$reflen=0;
close FILE1;
open FILE1, ">>",$mainpre."ref".$refcount or die $!;
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
$header=$z;
$seqst_temp="";
}
else
{
$z =~ s/[\n\t\f\r_0-9\s]//g;
$seqst_temp .= $z;
}
}#end of File while loop

if($reflen < $allowedsize)
{
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
else
{
$refcount++;
$reflen=0;
close FILE1;
open FILE1, ">>",$mainpre."ref".$refcount or die $!;
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
$header=$z;
$seqst_temp="";



close FILE1;
close FILE2;

#open first qry file
open FILE1, ">>",$mainpre."qry".$qrycount or die $!;

#reading the refrence sequences
open FILE2, "<",$mainqry or die $!;


$seqst_temp="";
$header = ;#reading the first header

while($z = ){
if($z=~ />/)
{#next sequence
$qrylen=$qrylen+length $seqst_temp;
if($qrylen < $allowedsize)
{
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
else
{
$qrycount++;
close FILE1;
$qrylen=0;
open FILE1, ">>",$mainpre."qry".$qrycount or die $!;
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
$header=$z;
$seqst_temp="";
}
else
{
$z =~ s/[\n\t\f\r_0-9\s]//g;
$seqst_temp .= $z;
}
}#end of File while loop

if($qrylen < $allowedsize)
{
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
else
{
$qrycount++;
close FILE1;
$qrylen=0;
open FILE1, ">>",$mainpre."qry".$qrycount or die $!;
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
$header=$z;
$seqst_temp="";

close FILE1;
close FILE2;

for($i=1;$i<=$qrycount;$i++){
for($j=1;$j<=$refcount;$j++){
$refseq=$mainpre."ref".$j;
$qryseq=$mainpre."qry".$i;
$prefix=$mainpre."nuc".$j."".$i;
system("/bubo/home/h7/nagarjun/glob/tools/MUMmer3.22/nucmer -maxmatch -c $overlap $refseq $qryseq -p $prefix");
}
}


#merging delta files
open FILE1, ">>",$mainpre.".delta" or die $!;
for($i=1;$i<=$qrycount;$i++){
for($j=1;$j<=$refcount;$j++){
open FILE2, "<",$mainpre."nuc".$j."".$i.".delta" or die $!;
if(($i==1)&&($j==1)){while($a=){print FILE1 $a;}}
else{$a=;$a=;while($a=){print FILE1 $a;}}
}
}

Tuesday, January 18, 2011

Blat instead of Nucmer in Minimus2

Minimus2 is a good utility to merge assemblies based on the AMOS architecture. However, it has a limitation on the size of the input datasets. This is due to the Nucmer program is its pipline. Apart from Nucmer other steps are able to handle much bigger chunks of data. So instead of using Nucmer i tried using Blat. Although this gave me assemblies, they were not always correct and complete.

May be someday somebody will use it.In the meantime i try to use Nucmer iteratively on smaller chunks and then merge the output before the next step. This is possible as Nucmer is run with the option -maxmatch in Minimus, making it independent of the other sequences in the dataset.

1)The changes to Minimu2.acf in amos/src/Pipeline/:

BINDIR = /usr/local/bin
NUCMER = /usr/local/nucmer
DELTAFILTER = /usr/local/delta-filter
SHOWCOORDS = /usr/local/show-coords
BLAT = blat
BTON = perl psltoNucmer.pl

#-----------------------------
-------------------------------------------------#

## Building AMOS bank & Dumping reads
10: rm -fr $(BANK)
11: $(BINDIR)/bank-transact -c -z -b $(BANK) -m $(TGT)
12: $(BINDIR)/dumpreads $(BANK) -M $(REFCOUNT) > $(REFSEQ)
13: $(BINDIR)/dumpreads $(BANK) -m $(REFCOUNT) > $(QRYSEQ)

## Getting overlaps
20: $(BLAT) $(REFSEQ) $(QRYSEQ) $(BOUT) -fastMap -noHead -minIdentity=98
21: $(BTON) $(BOUT) | $(BINDIR)/nucmerAnnotate | egrep 'BEGIN|END|CONTAIN|IDENTITY' > $(COORDS)
22: $(BINDIR)/nucmer2ovl -ignore $(MAXTRIM) -tab $(COORDS) | $(BINDIR)/sort2 > $(OVLTAB)


2) The code for psltoNucmer.pl is given below:

#!/usr/bin/perl

open FILE2, "<",$ARGV[0] or die $!;

while($z = ){
@values=split(/\t/,$z);
print $values[15]."\t".$values[16]."\t|\t".$values[11]."\t".$values[12]."\t|\t".abs($values[16]-$values[15])."\t".abs($values[12]-$values[11])."\t|\t".(sprintf "%.2f",(($values[0]+$values[2]+$values[3])*100)/abs($values[12]-$values[11]))."\t|\t".$values[14]."\t".$values[10]."\t|\t".(sprintf "%.2f",(abs($values[16]-$values[15])*100/$values[14]))."\t".(sprintf "%.2f",(abs($values[12]-$values[11])*100/$values[10]))."\t|\t".$values[13]."\t".$values[9]."\n";
}#end of File while loop

3) blat output format details can be found here : http://genome.ucsc.edu/goldenPath/help/blatSpec.html