Wednesday, March 28, 2012

MD5 file list checker

This program reads a file with filenames with correct path and MD5 values separated by a tab character and checks if the MD5 is correct or not.

#!/usr/bin/perl
use strict;
use Digest::MD5 qw(md5_base64);

open MD5, $ARGV[0] or die $!;
my $line="";
while($line = ){
chomp $line;
my @files=split(/[ \t]+/,$line);
open(FILE, $files[1]) or die "Can't find file $files[1]\n";
my $digobj = Digest::MD5->new;
$digobj->addfile(*FILE);
$digest = $digobj->hexdigest;
close(FILE);
if($digest!~m/$files[0]/){print "Md5 does not match for file:" . $files[1];}
else{
print "Md5 match for file:$files[1]\n";
print $digest ."\n" . $files[0] . "\n";
}
}

It would make sense to have MD5 checks integrated into the OS and have a MD5 list in each folder. May be it will get added into the code of various programs like ftp, sftp or even ordinary copy and mv.

Sort fasta file

Many programs like GATK require the fasta files to be sorted before use. Here is a rather simple script for the job:

 #!/usr/bin/perl  
 open FASTA, $ARGV[0] or die $!;  
 my $temp="";  
 my $seqs = {SEQ =>my $fheader};  
 my $sortemp="";  
 while($line = <FASTA> ){  
 if($line=~ /^>/){  
 if($header){$seqs{$header}{SEQ}=$temp;}  
 chomp $line;  
 $header="";  
 $line =~ s/[\s]/_/g;  
 $header=$line;  
 $temp="";  
 }  
 else{$line =~ s/[\n\t\f\r_0-9\s]//g;$temp .= $line;}  
 }#end of while loop  
 if($header){$seqs{$header}{SEQ}=$temp;}  
 close FASTA;  
 foreach $sortemp (sort keys %seqs) {  
 print "$sortemp\n";  
 print "$seqs{$sortemp}{SEQ}\n";  
 }  

However, you can find more elegant solutions that use Bioperl at Wolf/Takebayashi lab.