Wednesday, March 28, 2012

Sort fasta file

Many programs like GATK require the fasta files to be sorted before use. Here is a rather simple script for the job:

 #!/usr/bin/perl  
 open FASTA, $ARGV[0] or die $!;  
 my $temp="";  
 my $seqs = {SEQ =>my $fheader};  
 my $sortemp="";  
 while($line = <FASTA> ){  
 if($line=~ /^>/){  
 if($header){$seqs{$header}{SEQ}=$temp;}  
 chomp $line;  
 $header="";  
 $line =~ s/[\s]/_/g;  
 $header=$line;  
 $temp="";  
 }  
 else{$line =~ s/[\n\t\f\r_0-9\s]//g;$temp .= $line;}  
 }#end of while loop  
 if($header){$seqs{$header}{SEQ}=$temp;}  
 close FASTA;  
 foreach $sortemp (sort keys %seqs) {  
 print "$sortemp\n";  
 print "$seqs{$sortemp}{SEQ}\n";  
 }  

However, you can find more elegant solutions that use Bioperl at Wolf/Takebayashi lab.

No comments: