Wednesday, December 1, 2010

Program to shred contigs into reads with overlap

Most assembly programs seem to have some limit on the length of the reads that they take as input for assembling them. Assembling contigs can be a difficult problem in such cases. Shredding the contigs into smaller reads with a 50% or more of overlap should retain the benefit of having assembled the contigs and use it for further assembly.

Here is a small program to shred contigs into 1kb reads with 500 base pair overlap:

#!/usr/bin/perl

open FILE1, $ARGV[0] or die $!;

while($line1 = ){
my $flag=0,$overlap=500,$length=1000;

chomp($line1);

$line2 = ;
$seqlen=length $line2;
chomp($line2);

while(($seqlen-$flag)>$length){
if(($seqlen-($flag+$overlap))<$length){
$length=$seqlen-$flag;
}
$nextseq=substr $line2,$flag,$length;
print $line1.":".$flag."-".($flag+$length)."\n";
print $nextseq."\n";
$flag=$flag+$overlap;

}#end of seqlen while loop


}#end of file while loop

This script works on a multifasta file with header in the first line and the sequence in the second line.