Nagarjun's blog: Program to shred contigs into reads with overlap

Wednesday, December 1, 2010

Program to shred contigs into reads with overlap

Most assembly programs seem to have some limit on the length of the reads that they take as input for assembling them. Assembling contigs can be a difficult problem in such cases. Shredding the contigs into smaller reads with a 50% or more of overlap should retain the benefit of having assembled the contigs and use it for further assembly.

Here is a small program to shred contigs into 1kb reads with 500 base pair overlap:

#!/usr/bin/perl

open FILE1, $ARGV[0] or die $!;

while($line1 = ){
my $flag=0,$overlap=500,$length=1000;

chomp($line1);

$line2 = ;
$seqlen=length $line2;
chomp($line2);

while(($seqlen-$flag)>$length){
if(($seqlen-($flag+$overlap))<$length){
$length=$seqlen-$flag;
}
$nextseq=substr $line2,$flag,$length;
print $line1.":".$flag."-".($flag+$length)."\n";
print $nextseq."\n";
$flag=$flag+$overlap;

}#end of seqlen while loop

}#end of file while loop

This script works on a multifasta file with header in the first line and the sequence in the second line.

6 comments:

Anonymous said...: Nice post, kind of drawn out though. Really good subject matter though.; December 9, 2010 at 12:58 AM
Anonymous said...: Wow, that's crazy man. They should really try to do something to fix that.; December 10, 2010 at 8:44 AM
Anonymous said...: Nice post, kind of drawn out though. Really good subject matter though.; December 13, 2010 at 6:22 AM
Anonymous said...: there are some error at
$line1 =
$line2 =
can you suggest something; April 18, 2011 at 2:48 AM
Author said...: the filehandle symbols has been eaten by html. It needs to written with css to be visible.

$line1 =
$line2 =; April 20, 2011 at 3:20 AM
Anonymous said...: To see whats there in the two lines:

right-click on page >> view source >> look for $line1; November 25, 2011 at 4:42 PM