Friday, August 22, 2014

Extract sequences from multi fasta file in a particular order


 #!/usr/bin/perl  
 use warnings;  
 #perl extractInOrder.pl test.fasta test.list  
 # Input parameters  
 open FASTA, $ARGV[0] or die $!;  
 my $seqst_temp="";  
 my $seqs = ();  
 while($line = <FASTA>){  
 if($line=~ /^>/){  
 if($header){  
 $header=~s/>//;  
 $seqs{$header}=$seqst_temp;  
 }  
 chomp $line;  
 $header="";  
 $header=$line;  
 $seqst_temp="";  
 }  
 else{  
 $line =~ s/[\n\t\f\r_0-9\s]//g;  
 $seqst_temp .= $line;  
 }  
 }#end of while loop  
 if($header){  
 $header=~s/>//;  
 $seqs{$header}=$seqst_temp;  
 }  
 close FASTA;  
 open FASTA, $ARGV[1] or die $!;  
 while($line = <FASTA>){  
 chomp $line;  
 $line=~s/>//;  
 if(exists $seqs{$line}){print ">$line\n$seqs{$line}\n";}  
 else {  
 print "Sequence header $line does not exist in fasta\n";  
 exit();  
 }  
 }  
 close FASTA;  

Merge multiple kmer count hashes into one

A previous attempt at merging two kmer count hashes was neither memory efficient nor capable of merging multiple kmer count hashes. Here, we use the age old trick of sorting to write a more memory efficient script that can handle as N number of hashes.


 cat *_"$kmer"_counts.fa|sort > sorted_"$kmer"_all.fa  

The above command will concatenate all the hashes and sort it. This sorted file can then be used by the below perl script to merge the hashes. Since, all kmers that need to be merged are in adjacent lines, the memory needed for merging is drastically reduced compared to the previous script.


 #!/usr/bin/perl  
 use warnings;  
 # Input parameters  
 open FASTA1, $ARGV[0] or die $!;  
 my $previous="Kmer";  
 my $previousCount="Kmercount"; 
 my @jelly;
  while($line = <FASTA1>){  
 chomp $line;  
 @jelly=split(/\s+/,$line);  
      if($previous=~/$jelly[0]/){  
      $previousCount=$previousCount+$jelly[1];  
      }  
      else{  
      print "$previous\t$previousCount\n";  
      $previous=$jelly[0];$previousCount=$jelly[1];  
      }  
 }  
      #printing last line if it needed merging  
      if($previous=~/$jelly[0]/){  
      print "$previous\t$previousCount\n";  
      }  
 close FASTA1;  

Tuesday, August 12, 2014

Merge two kmer count hashes into one

 #!/usr/bin/perl  
 use warnings;  
 # Input parameters  
 open FASTA1, $ARGV[0] or die $!;  
 open FASTA2, $ARGV[1] or die $!;  
 my %seqs = ();  
 while($line = <FASTA1>){  
 chomp $line;  
 my @jelly=split(/\s+/,$line);  
 $seqs{$jelly[0]}=$jelly[1];  
 }  
 close FASTA1;  
 while($line = <FASTA2>){  
 chomp $line;  
 my @jelly=split(/\s+/,$line);  
 if(exists $seqs{$jelly[0]}){$seqs{$jelly[0]}=$seqs{$jelly[0]}+$jelly[1];}  
 else{$seqs{$jelly[0]}=$jelly[1];}  
 }  
 close FASTA2;  
 foreach $iso (sort keys %seqs) {  
 print "$iso\t$seqs{$iso}\n";  
 }  

Will take two lists of kmer counts and merge them into one. A kmer count list consists of two columns. The first column being the kmer itself and the second being its count.

This might be useful in merging the output of a program like jellyfish after running it on each chromosome separately. While jellyfish has a merge function, it requires the hashes to be of equal size.

May be this can be re-written as a perl one liner...

Ka ka ka ka ka ka ka ka

Previous crow stuff has been covered by Jerry Coyne himself. This in turn stimulates lot of debate on what constitutes a species. Fortunately, most people agree that understanding the genetics is more important and relevant than fighting over the definition of species. 

Like the title of this post, the call of a crow can be written as "ka ka ka ka ka" or "kaw kaw kaw kaw" or "Caw caw caw caw". They are all trying to capture the same thing. As long as the underlying genetics is well understood, it should be ok to not worry about the definition for now. Hopefully, over time our understanding of genetics will be so good that we may even come up with a single definition that most of us can agree on.

Monday, August 11, 2014

Get overlapping sequences from multifasta file

 #!/usr/bin/perl  
 #use strict;  
 use warnings;  
 # Input parameters  
 open FASTA, $ARGV[0] or die $!;  
 my $seqst_temp="";  
 my %seqs = ();  
 my $iso="";  
 my $maxlen=0;  
 my $maxval="";  
 while($line = <FASTA>){  
 if($line=~ /^>/){  
 if($header){  
 $seqs{$header}=$seqst_temp;  
 }  
 chomp $line;  
 $header="";  
 $header=$line;  
 $seqst_temp="";  
 }  
 else{  
 $line =~ s/[\n\t\f\r_0-9\s]//g;  
 $seqst_temp .= $line;  
 }  
 }#end of while loop  
 if($header){  
 $seqs{$header}=$seqst_temp;  
 }  
 close FASTA;  
 $maxlen=0;  
 foreach $iso (sort keys %seqs) {  
 my $line1=$iso;  
 my $line2=$seqs{$iso};  
 my $flag=0,$overlap=500,$length=1000;  
 $seqlen=length $line2;  
 while(($seqlen-$flag)>$length){  
 if(($seqlen-($flag+$overlap))<$length){  
 $length=$seqlen-$flag;  
 }  
 $nextseq=substr $line2,$flag,$length;  
 print $line1.":".$flag."-".($flag+$length)."\n";  
 print $nextseq."\n";  
 $flag=$flag+$overlap;  
 }#end of seqlen while loop  
 }