ALLPATHS-LG had the DEDUP option turned off by default prior to release 42726. So this bit of code identifies exact duplicates either on the same or negative strand. Many of my runs had 30 to 60 Kb of sequence that was duplicated with almost equal amounts on both strands.
Although this does the job, its ugly in having the same bit of code used twice, once within the loop and once after the while loop. May be the end of file can be handled more elegantly.
#!/usr/bin/perl
use warnings;
# Input parameters
open FASTA, $ARGV[0] or die $!;
my $seqst_temp="";
my $seqs = {GENENAME =>my $genename,LEN =>my $qcom};
while($line = <FASTA>){
if($line=~ /^>/){
if($header){
if(exists $seqs{$seqst_temp}{GENENAME}){print "$seqs{$seqst_temp}{GENENAME}\t$header\t$seqs{$seqst_temp}{LEN}\n";}
$rseqst_temp = $seqst_temp;
$rseqst_temp=revcomp($rseqst_temp);
if(exists $seqs{$rseqst_temp}{GENENAME}){print "$seqs{$rseqst_temp}{GENENAME}\t$header\t$seqs{$rseqst_temp}{LEN}\treverse\n";}
$seqs{$seqst_temp}{GENENAME}=$header;
$seqs{$seqst_temp}{LEN}=length $seqst_temp;
}
chomp $line;
$header="";
$header=$line;
$seqst_temp="";
$rseqst_temp="";
}
else{
$line =~ s/[\n\t\f\r_0-9\s]//g;
$seqst_temp .= $line;
}
}#end of while loop
if($header){
if(exists $seqs{$seqst_temp}{GENENAME}){print "$seqs{$seqst_temp}{GENENAME}\t$header\t$seqs{$seqst_temp}{LEN}\n";}
$rseqst_temp = $seqst_temp;
$rseqst_temp=revcomp($rseqst_temp);
if(exists $seqs{$rseqst_temp}{GENENAME}){print "$seqs{$rseqst_temp}{GENENAME}\t$header\t$seqs{$rseqst_temp}{LEN}\treverse\n";}
$seqs{$seqst_temp}{GENENAME}=$header;
$seqs{$seqst_temp}{LEN}=length $seqst_temp;
}
close FASTA;
sub revcomp{
my $input = shift;
my $revcomp = reverse($input);
$revcomp =~ tr/ACGTacgt/TGCAtgca/;
return $revcomp;
}
Although this does the job, its ugly in having the same bit of code used twice, once within the loop and once after the while loop. May be the end of file can be handled more elegantly.
No comments:
Post a Comment