Nagarjun's blog: 2011

Saturday, August 20, 2011

Why you cannot give Feedback to the LS?

The recent Anna Hazare movement has been opposed by many. Intellectuals like Nandan Nilekani among many others wanted the use better ways to comment or provide feedback to the parliament.

I wonder if Mr Nilekani has ever tried using the feedback form on the parliament of India website. The website which is designed and hosted by National Informatics Centre has a very interesting feature indeed :)

It has excellent code which makes sure that feedback can never be submitted. If you don't believe me you can try it yourself. As soon as you type anything in the feedback form the javascript onkey event calls the validate function. This function is supposed to replace all special characters. However, the brilliant person who designed this included the ^ (caret) character in the "myC" array that lists all special characters. This results in all possible characters being matched by the Regular expression. Hence, anything thats typed into the feedback form gets deleted.

If this is the situation in a website that can be easily verified by anybody with access to the internet. We can only imagine what happens to the feedback which actually manages to reach the Politicians after manging to escape the caret.

Monday, July 4, 2011

English? Inglish? Or watever this is!!

Recently, i found this(See picture) stamped on a bill from westside. It says, "No Exchange on Sale Merchandise". My first thought was it meant "No Exchange on Sold Merchandise". Would this qualify for Inglish? May be the BBC would find it interesting enough to publish? A similar collection of images about chinglish was published few years ago.

The English may find that their language is being ruined. But is that the price that must be paid to be an international language? How much can a language be stretched and modified before it stops being itself?

After further thought, it seemed that it actually meant "No Exchange on merchandise which is on 'Sale(Being sold at a lower price)' ". May be it is self evident as a measure to save on print space.

Wednesday, June 1, 2011

Hyperboloids of wondrous Light

" Hyperboloids of wondrous Light

Rolling for aye through Space and Time

Harbour those Waves which somehow Might

Play out God's holy pantomime”

This is what it says on the gravestone of Alan Turing, the code breaker who was involved in breaking the Enigma's cryptic codes. I like to think of these hyperboloids as the very essence of the universe. An explanation for the existence of our universe and development of intelligent life.

Many would agree that the "genome" is indeed God's holy pantomime. It will not be surprising that it plays out not just in the living world but also in most things in the universe. A mathematical explanation for the way things happen. How probable is the development of intelligent life? Is the universe bound by rules that direct the formation of intelligent life? These are some questions that could shed light on the oldest questions that man has sought answers to.

Turing does feature on the list of atheists, although its not clear when and what changed his thoughts. However, its probably worth noting that the "God" here would most likely not refer to a god associated with any religious denomination. A spiritual god cannot be ruled out, but most likely refers to a mathematical god, one of physical principles.

Wednesday, April 20, 2011

The problem with walls

It is very common to have walls surrounding most buildings in tropical countries. However, walls are rare and fences are more common in the colder(LOL) countries. Is it just a question of style or does it have anything to do with the climate?

Walls would make it easier for the snow to pileup around a building making it inaccessible. Removal of snow would also be complicated by the presence of solid structures hidden below layers of snow. Such obstructions would have to be properly marked, if heavy machinery is to be used for clearing the snow. Moreover, as the snow melts, walls without proper drainage could cause the formation of puddles and stagnation of water. These seem reasonable enough to have fences instead of walls.

Forts had walls to act as a barrier to invading armies. With the advent of modern warfare, walls have lost their significance as defenses against human armies. Cattle and wild animals can be contained within or outside walls.

Thursday, March 3, 2011

Break scaffolds into Mate pairs

Could not find any programs that can merge a set of scaffolds. Hence, decided to break the scaffolds into mate pairs and use those for scaffolding. Below is a simple perl script to break scaffolds into all possible mate pairs in a format suitable for SSPACE (a hierarchical scaffolder). It can easily be modified to work for BAMBUS or any other scaffolder or assembler.

#!/usr/bin/perl

#AUTHOR
# Nagarjun Vijay (c) 2011
# mate maker.pl - program to break scaffolds into mate pairs
#

#DATE
# 3rd March 2011
#

use strict;

my $scafile=$ARGV[0];

#reading the scaffold sequences
open SCAF, "<",$scafile or die $!;

#open contigs file
open CONTIG, ">>",$scafile . ".contigs.fa" or die $!;

#open contigs file
open LIB, ">>",$scafile . ".lib" or die $!;

my ($scaffoldcount,$seqst_temp,$contigcount,$notscaffoldcount,$z,$head,$i,$j,$libcount);
$scaffoldcount=0,$seqst_temp="",$contigcount=0,$notscaffoldcount=0,$libcount=0;
my $header = ;#reading the first header

while($z = ){
if($z=~ />/)
{
#split scaffold into contigs
if($seqst_temp=~ m/[N]+/){#is really a scaffold
$scaffoldcount++;
$contigcount +=SplitScaffold($seqst_temp);
}#end of scaffold check if
else{#not a scaffold - so not writing into breaker file
$head=">unbreak";
$notscaffoldcount++;
print CONTIG $head . "" . $notscaffoldcount ."\n";
print CONTIG "$seqst_temp\n";
}#end of scaffold check else

#next sequence
$seqst_temp="";
}
else
{
$z =~ s/[\n\t\f\r_0-9\s]//g;
$seqst_temp .= $z;
}
}#end of File while loop

if($seqst_temp=~ m/[N]+/){#is really a scaffold
$scaffoldcount++;
$contigcount +=SplitScaffold($seqst_temp);
}#end of scaffold check if
else{#not a scaffold - so not writing into breaker file
$head=">unbreak";
$notscaffoldcount++;
print CONTIG $head . "" . $notscaffoldcount ."\n";
print CONTIG "$seqst_temp\n";
}#end of scaffold check else
close SCAF;
close CONTIG;

#Split scaffolds, write contigs and breaker file
sub SplitScaffold{
my $seq = shift;
my ($head,$num,$contigs,$n);
$head=">breaker$scaffoldcount";
$num=1,$contigs=1;
my @values=split(/([N]+)/,$seq);
my $vallen= scalar (@values);
for($i=0;$i<$vallen;$i++) {
my $val=$values[$i];
if(($num % 2) == 0){
$n = length $val;
}
else{
print CONTIG $head . "" . $contigs ."\n";
print CONTIG "$val\n";
$contigs++;
my $matedist=(length $values[$i])+(length $values[$i+1])+(length $values[$i+2]);
for($j=$i+2;$j<$vallen;$j=$j+2){
print LIB "lib".$libcount." l".$libcount."s".$scaffoldcount."b".$i.".fasta l".$libcount."s".$scaffoldcount."b".$j.".fasta ".$matedist." 0.75 0\n";
$matedist=$matedist+(length $values[$j+1])+(length $values[$j+2]);
open READ1, ">>","l".$libcount."s".$scaffoldcount."b".$i.".fasta" or die $!;
open READ2, ">>","l".$libcount."s".$scaffoldcount."b".$j.".fasta" or die $!;
print READ1 ">l".$libcount."s".$scaffoldcount."b".$i."\n";
print READ1 $values[$i]."\n";
print READ2 ">l".$libcount."s".$scaffoldcount."b".$j."\n";
print READ2 reverseComplement($values[$j])."\n";
close READ1;
close READ2;
$libcount++;
}

}
$num++;
}

return $contigs;
}
#reverse complement a sequence
sub reverseComplement{
$_ = shift;
tr/ATGC/TACG/;
return (reverse());
}
#end of Breaker

Tuesday, February 22, 2011

convert Fasta files into TIGR .contig file

During the process of converting Fasta files with gaps to AFG format, one step required generation of TIGR .contig format. Given below is a simple script to generate a dummy .contig file which has just one read for each contig which is exactly the same as the contig. This is done due to the lack of read tracking in SOAP, SSPACE etc. and also to avoid the huge file size if all the reads are retained.

 #!/usr/bin/perl  
 open FASTA, "< $ARGV[0]" or die "Can't open $ARGV[0] ($!)\n";  
 while($z=<FASTA>){  
  if($z=~ />/)  
  {  
  chomp $z;  
  $header=(split(/>/,$z))[1];  
  $readid=(split(/_/,$header))[1];  
  $seq=;  
  chomp $seq;  
  $seqlen=length $seq;  
  print "##$header 1 $seqlen bases, 00000000 checksum.\n";  
  print "$seq\n";  
  print "#$readid(0) [] $seqlen bases, 00000000 checksum. {$seqlen 0} <1>\n";  
  print "$seq\n";  
  }  
 }

save it as fasta2contig.pl and use as "perl fasta2contig.pl contig.fa"

Thursday, January 20, 2011

Minimus2 larger datasets with Nucmer

While Blat is much more faster than Nucmer, its results can be very different from the results from Nucmer. A version of Minimus2 which splits the data before feeding it to Nucmer and merging it back before processing further is able to give a result which is almost always identical to the results from Nucmer. The changes required to be done are given below:

NUCMER = nucmer
DELTAFILTER = delta-filter
SHOWCOORDS = show-coords
RUNNUCMER = perl runnucmer.pl
#------------------------------------------------------------------------------#

## Building AMOS bank & Dumping reads
10: rm -fr $(BANK)
11: $(BINDIR)/bank-transact -c -z -b $(BANK) -m $(TGT)
12: $(BINDIR)/dumpreads $(BANK) -M $(REFCOUNT) > $(REFSEQ)
13: $(BINDIR)/dumpreads $(BANK) -m $(REFCOUNT) > $(QRYSEQ)

## Getting overlaps
20: $(RUNNUCMER) $(OVERLAP) $(REFSEQ) $(QRYSEQ) $(PREFIX)
21: $(SHOWCOORDS) -H -c -l -o -r -I $(MINID) $(ALIGN) | $(BINDIR)/nucmerAnnotate | egrep 'BEGIN|END|CONTAIN|IDENTITY' > $(COORDS)
22: $(BINDIR)/nucmer2ovl -ignore $(MAXTRIM) -tab $(COORDS) | $(BINDIR)/sort2 > $(OVLTAB)

The code for runnucmer.pl is given below:

#!/usr/bin/perl

$allowedsize=100000000;

$overlap=$ARGV[0];
$mainref=$ARGV[1];
$mainqry=$ARGV[2];
$mainpre=$ARGV[3];
$reflen=0;
$refcount=1;

$qrylen=0;
$qrycount=1;

#open first ref file
open FILE1, ">>",$mainpre."ref".$refcount or die $!;

#reading the refrence sequences
open FILE2, "<",$mainref or die $!;

$seqst_temp="";
$header = ;#reading the first header

while($z = ){
if($z=~ />/)
{#next sequence
$reflen=$reflen+length $seqst_temp;
if($reflen < $allowedsize)
{
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
else
{
$refcount++;
$reflen=0;
close FILE1;
open FILE1, ">>",$mainpre."ref".$refcount or die $!;
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
$header=$z;
$seqst_temp="";
}
else
{
$z =~ s/[\n\t\f\r_0-9\s]//g;
$seqst_temp .= $z;
}
}#end of File while loop

if($reflen < $allowedsize)
{
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
else
{
$refcount++;
$reflen=0;
close FILE1;
open FILE1, ">>",$mainpre."ref".$refcount or die $!;
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
$header=$z;
$seqst_temp="";

close FILE1;
close FILE2;

#open first qry file
open FILE1, ">>",$mainpre."qry".$qrycount or die $!;

#reading the refrence sequences
open FILE2, "<",$mainqry or die $!;

$seqst_temp="";
$header = ;#reading the first header

while($z = ){
if($z=~ />/)
{#next sequence
$qrylen=$qrylen+length $seqst_temp;
if($qrylen < $allowedsize)
{
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
else
{
$qrycount++;
close FILE1;
$qrylen=0;
open FILE1, ">>",$mainpre."qry".$qrycount or die $!;
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
$header=$z;
$seqst_temp="";
}
else
{
$z =~ s/[\n\t\f\r_0-9\s]//g;
$seqst_temp .= $z;
}
}#end of File while loop

if($qrylen < $allowedsize)
{
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
else
{
$qrycount++;
close FILE1;
$qrylen=0;
open FILE1, ">>",$mainpre."qry".$qrycount or die $!;
print FILE1 $header;
print FILE1 $seqst_temp;
print FILE1 "\n";
}
$header=$z;
$seqst_temp="";

close FILE1;
close FILE2;

for($i=1;$i<=$qrycount;$i++){
for($j=1;$j<=$refcount;$j++){
$refseq=$mainpre."ref".$j;
$qryseq=$mainpre."qry".$i;
$prefix=$mainpre."nuc".$j."".$i;
system("/bubo/home/h7/nagarjun/glob/tools/MUMmer3.22/nucmer -maxmatch -c $overlap $refseq $qryseq -p $prefix");
}
}

#merging delta files
open FILE1, ">>",$mainpre.".delta" or die $!;
for($i=1;$i<=$qrycount;$i++){
for($j=1;$j<=$refcount;$j++){
open FILE2, "<",$mainpre."nuc".$j."".$i.".delta" or die $!;
if(($i==1)&&($j==1)){while($a=){print FILE1 $a;}}
else{$a=;$a=;while($a=){print FILE1 $a;}}
}
}

Tuesday, January 18, 2011

Blat instead of Nucmer in Minimus2

Minimus2 is a good utility to merge assemblies based on the AMOS architecture. However, it has a limitation on the size of the input datasets. This is due to the Nucmer program is its pipline. Apart from Nucmer other steps are able to handle much bigger chunks of data. So instead of using Nucmer i tried using Blat. Although this gave me assemblies, they were not always correct and complete.

May be someday somebody will use it.In the meantime i try to use Nucmer iteratively on smaller chunks and then merge the output before the next step. This is possible as Nucmer is run with the option -maxmatch in Minimus, making it independent of the other sequences in the dataset.

1)The changes to Minimu2.acf in amos/src/Pipeline/:

BINDIR = /usr/local/bin
NUCMER = /usr/local/nucmer
DELTAFILTER = /usr/local/delta-filter
SHOWCOORDS = /usr/local/show-coords
BLAT = blat
BTON = perl psltoNucmer.pl

#-----------------------------

-------------------------------------------------#

## Building AMOS bank & Dumping reads
10: rm -fr $(BANK)
11: $(BINDIR)/bank-transact -c -z -b $(BANK) -m $(TGT)
12: $(BINDIR)/dumpreads $(BANK) -M $(REFCOUNT) > $(REFSEQ)
13: $(BINDIR)/dumpreads $(BANK) -m $(REFCOUNT) > $(QRYSEQ)

## Getting overlaps
20: $(BLAT) $(REFSEQ) $(QRYSEQ) $(BOUT) -fastMap -noHead -minIdentity=98
21: $(BTON) $(BOUT) | $(BINDIR)/nucmerAnnotate | egrep 'BEGIN|END|CONTAIN|IDENTITY' > $(COORDS)
22: $(BINDIR)/nucmer2ovl -ignore $(MAXTRIM) -tab $(COORDS) | $(BINDIR)/sort2 > $(OVLTAB)

2) The code for psltoNucmer.pl is given below:

#!/usr/bin/perl

open FILE2, "<",$ARGV[0] or die $!;

while($z = ){
@values=split(/\t/,$z);
print $values[15]."\t".$values[16]."\t|\t".$values[11]."\t".$values[12]."\t|\t".abs($values[16]-$values[15])."\t".abs($values[12]-$values[11])."\t|\t".(sprintf "%.2f",(($values[0]+$values[2]+$values[3])*100)/abs($values[12]-$values[11]))."\t|\t".$values[14]."\t".$values[10]."\t|\t".(sprintf "%.2f",(abs($values[16]-$values[15])*100/$values[14]))."\t".(sprintf "%.2f",(abs($values[12]-$values[11])*100/$values[10]))."\t|\t".$values[13]."\t".$values[9]."\n";
}#end of File while loop

3) blat output format details can be found here : http://genome.ucsc.edu/goldenPath/help/blatSpec.html