Wednesday, December 1, 2010

Program to shred contigs into reads with overlap

Most assembly programs seem to have some limit on the length of the reads that they take as input for assembling them. Assembling contigs can be a difficult problem in such cases. Shredding the contigs into smaller reads with a 50% or more of overlap should retain the benefit of having assembled the contigs and use it for further assembly.

Here is a small program to shred contigs into 1kb reads with 500 base pair overlap:

#!/usr/bin/perl

open FILE1, $ARGV[0] or die $!;

while($line1 = ){
my $flag=0,$overlap=500,$length=1000;

chomp($line1);

$line2 = ;
$seqlen=length $line2;
chomp($line2);

while(($seqlen-$flag)>$length){
if(($seqlen-($flag+$overlap))<$length){
$length=$seqlen-$flag;
}
$nextseq=substr $line2,$flag,$length;
print $line1.":".$flag."-".($flag+$length)."\n";
print $nextseq."\n";
$flag=$flag+$overlap;

}#end of seqlen while loop


}#end of file while loop

This script works on a multifasta file with header in the first line and the sequence in the second line.

6 comments:

Anonymous said...

Nice post, kind of drawn out though. Really good subject matter though.

Anonymous said...

Wow, that's crazy man. They should really try to do something to fix that.

Anonymous said...

Nice post, kind of drawn out though. Really good subject matter though.

Anonymous said...

there are some error at
$line1 =
$line2 =
can you suggest something

Author said...

the filehandle symbols has been eaten by html. It needs to written with css to be visible.

$line1 =
$line2 =

Anonymous said...

To see whats there in the two lines:

right-click on page >> view source >> look for $line1