Thursday, January 20, 2011

Minimus2 larger datasets with Nucmer

While Blat is much more faster than Nucmer, its results can be very different from the results from Nucmer. A version of Minimus2 which splits the data before feeding it to Nucmer and merging it back before processing further is able to give a result which is almost always identical to the results from Nucmer. The changes required to be done are given below:

NUCMER = nucmer
DELTAFILTER = delta-filter
SHOWCOORDS = show-coords
RUNNUCMER = perl runnucmer.pl
#------------------------------------------------------------------------------#

## Building AMOS bank & Dumping reads
10: rm -fr $(BANK)
11: $(BINDIR)/bank-transact -c -z -b $(BANK) -m $(TGT)
12: $(BINDIR)/dumpreads $(BANK) -M $(REFCOUNT) > $(REFSEQ)
13: $(BINDIR)/dumpreads $(BANK) -m $(REFCOUNT) > $(QRYSEQ)

## Getting overlaps
20: $(RUNNUCMER) $(OVERLAP) $(REFSEQ) $(QRYSEQ) $(PREFIX)
21: $(SHOWCOORDS) -H -c -l -o -r -I $(MINID) $(ALIGN) | $(BINDIR)/nucmerAnnotate | egrep 'BEGIN|END|CONTAIN|IDENTITY' > $(COORDS)
22: $(BINDIR)/nucmer2ovl -ignore $(MAXTRIM) -tab $(COORDS) | $(BINDIR)/sort2 > $(OVLTAB)


The code for runnucmer.pl is given below:

#!/usr/bin/perl

$allowedsize=100000000;

$overlap=$ARGV[0];
$mainref=$ARGV[1];
$mainqry=$ARGV[2];
$mainpre=$ARGV[3];
$reflen=0;
$refcount=1;

$qrylen=0;
$qrycount=1;

#open first ref file
open FILE1, ">>",$mainpre."ref".$refcount or die $!;

#reading the refrence sequences
open FILE2, "<",$mainref or die $!;


$seqst_temp="";
$header = ;#reading the first header

while($z = ){
if($z=~ />/)
{#next sequence
$reflen=$reflen+length $seqst_temp;
if($reflen < $allowedsize)
{
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
else
{
$refcount++;
$reflen=0;
close FILE1;
open FILE1, ">>",$mainpre."ref".$refcount or die $!;
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
$header=$z;
$seqst_temp="";
}
else
{
$z =~ s/[\n\t\f\r_0-9\s]//g;
$seqst_temp .= $z;
}
}#end of File while loop

if($reflen < $allowedsize)
{
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
else
{
$refcount++;
$reflen=0;
close FILE1;
open FILE1, ">>",$mainpre."ref".$refcount or die $!;
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
$header=$z;
$seqst_temp="";



close FILE1;
close FILE2;

#open first qry file
open FILE1, ">>",$mainpre."qry".$qrycount or die $!;

#reading the refrence sequences
open FILE2, "<",$mainqry or die $!;


$seqst_temp="";
$header = ;#reading the first header

while($z = ){
if($z=~ />/)
{#next sequence
$qrylen=$qrylen+length $seqst_temp;
if($qrylen < $allowedsize)
{
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
else
{
$qrycount++;
close FILE1;
$qrylen=0;
open FILE1, ">>",$mainpre."qry".$qrycount or die $!;
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
$header=$z;
$seqst_temp="";
}
else
{
$z =~ s/[\n\t\f\r_0-9\s]//g;
$seqst_temp .= $z;
}
}#end of File while loop

if($qrylen < $allowedsize)
{
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
else
{
$qrycount++;
close FILE1;
$qrylen=0;
open FILE1, ">>",$mainpre."qry".$qrycount or die $!;
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
$header=$z;
$seqst_temp="";

close FILE1;
close FILE2;

for($i=1;$i<=$qrycount;$i++){
for($j=1;$j<=$refcount;$j++){
$refseq=$mainpre."ref".$j;
$qryseq=$mainpre."qry".$i;
$prefix=$mainpre."nuc".$j."".$i;
system("/bubo/home/h7/nagarjun/glob/tools/MUMmer3.22/nucmer -maxmatch -c $overlap $refseq $qryseq -p $prefix");
}
}


#merging delta files
open FILE1, ">>",$mainpre.".delta" or die $!;
for($i=1;$i<=$qrycount;$i++){
for($j=1;$j<=$refcount;$j++){
open FILE2, "<",$mainpre."nuc".$j."".$i.".delta" or die $!;
if(($i==1)&&($j==1)){while($a=){print FILE1 $a;}}
else{$a=;$a=;while($a=){print FILE1 $a;}}
}
}

Tuesday, January 18, 2011

Blat instead of Nucmer in Minimus2

Minimus2 is a good utility to merge assemblies based on the AMOS architecture. However, it has a limitation on the size of the input datasets. This is due to the Nucmer program is its pipline. Apart from Nucmer other steps are able to handle much bigger chunks of data. So instead of using Nucmer i tried using Blat. Although this gave me assemblies, they were not always correct and complete.

May be someday somebody will use it.In the meantime i try to use Nucmer iteratively on smaller chunks and then merge the output before the next step. This is possible as Nucmer is run with the option -maxmatch in Minimus, making it independent of the other sequences in the dataset.

1)The changes to Minimu2.acf in amos/src/Pipeline/:

BINDIR = /usr/local/bin
NUCMER = /usr/local/nucmer
DELTAFILTER = /usr/local/delta-filter
SHOWCOORDS = /usr/local/show-coords
BLAT = blat
BTON = perl psltoNucmer.pl

#-----------------------------
-------------------------------------------------#

## Building AMOS bank & Dumping reads
10: rm -fr $(BANK)
11: $(BINDIR)/bank-transact -c -z -b $(BANK) -m $(TGT)
12: $(BINDIR)/dumpreads $(BANK) -M $(REFCOUNT) > $(REFSEQ)
13: $(BINDIR)/dumpreads $(BANK) -m $(REFCOUNT) > $(QRYSEQ)

## Getting overlaps
20: $(BLAT) $(REFSEQ) $(QRYSEQ) $(BOUT) -fastMap -noHead -minIdentity=98
21: $(BTON) $(BOUT) | $(BINDIR)/nucmerAnnotate | egrep 'BEGIN|END|CONTAIN|IDENTITY' > $(COORDS)
22: $(BINDIR)/nucmer2ovl -ignore $(MAXTRIM) -tab $(COORDS) | $(BINDIR)/sort2 > $(OVLTAB)


2) The code for psltoNucmer.pl is given below:

#!/usr/bin/perl

open FILE2, "<",$ARGV[0] or die $!;

while($z = ){
@values=split(/\t/,$z);
print $values[15]."\t".$values[16]."\t|\t".$values[11]."\t".$values[12]."\t|\t".abs($values[16]-$values[15])."\t".abs($values[12]-$values[11])."\t|\t".(sprintf "%.2f",(($values[0]+$values[2]+$values[3])*100)/abs($values[12]-$values[11]))."\t|\t".$values[14]."\t".$values[10]."\t|\t".(sprintf "%.2f",(abs($values[16]-$values[15])*100/$values[14]))."\t".(sprintf "%.2f",(abs($values[12]-$values[11])*100/$values[10]))."\t|\t".$values[13]."\t".$values[9]."\n";
}#end of File while loop

3) blat output format details can be found here : http://genome.ucsc.edu/goldenPath/help/blatSpec.html