Wednesday, December 1, 2010

Program to shred contigs into reads with overlap

Most assembly programs seem to have some limit on the length of the reads that they take as input for assembling them. Assembling contigs can be a difficult problem in such cases. Shredding the contigs into smaller reads with a 50% or more of overlap should retain the benefit of having assembled the contigs and use it for further assembly.

Here is a small program to shred contigs into 1kb reads with 500 base pair overlap:

#!/usr/bin/perl

open FILE1, $ARGV[0] or die $!;

while($line1 = ){
my $flag=0,$overlap=500,$length=1000;

chomp($line1);

$line2 = ;
$seqlen=length $line2;
chomp($line2);

while(($seqlen-$flag)>$length){
if(($seqlen-($flag+$overlap))<$length){
$length=$seqlen-$flag;
}
$nextseq=substr $line2,$flag,$length;
print $line1.":".$flag."-".($flag+$length)."\n";
print $nextseq."\n";
$flag=$flag+$overlap;

}#end of seqlen while loop


}#end of file while loop

This script works on a multifasta file with header in the first line and the sequence in the second line.

6 comments:

  1. Nice post, kind of drawn out though. Really good subject matter though.

    ReplyDelete
  2. Wow, that's crazy man. They should really try to do something to fix that.

    ReplyDelete
  3. Nice post, kind of drawn out though. Really good subject matter though.

    ReplyDelete
  4. there are some error at
    $line1 =
    $line2 =
    can you suggest something

    ReplyDelete
  5. the filehandle symbols has been eaten by html. It needs to written with css to be visible.

    $line1 =
    $line2 =

    ReplyDelete
  6. To see whats there in the two lines:

    right-click on page >> view source >> look for $line1

    ReplyDelete