Tuesday, October 19, 2010

Handling duplicate accession numbers created by flowsim with sfffile command

Having simulated loads of 454 data with Flowsim, i wanted to assemble everything. Unfortunately Flowsim gives one output .sff file for one input fasta file. In my case there were some 1,00,000 fasta sequences and hence 1,00,000 .sff files.

Newbler (assembler for 454 data) just crashed when asked to assemble the data in these many different files. It worked upto 500 .sff files though.

When all the files where given as input to sfffile to combine into a single file it gave a cryptic "segmentation fault" error. It seemed as though all the loads of data generated by Flowsim would be wasted. As a lost resort i wrote a script to combine the .sff files two at a time.

The script crashed after running for 100 files. The culprit was "Flowsim". It had generated sequences with the same accession number in the different runs. sfffile was failing with "Error: Duplicate accession chr1:56036789..56039961_1148- found. Unable to write SFF file.". After further investigation it became clear that duplciate accession numbers had occured rather frequently. Later versions of Flowsim may have fixed this issue (I used version 0.1).

Since, the data had already been generated it had to "handled". sfffile does not have a flag to ignore duplicate accession numbers. Hence, a simple perl script that checks if sfffile failed in the previous step take care of the duplicate accession numbers. Perl script below:

 my $startfile="combined1.sff";  
 print "sfffile -o combined".$count.".sff combined".$pcount.".sff myreads.fasta.".$count.".sff \n";  
 if(-e $pfilename){  
 system("sfffile -o combined".$count.".sff combined".$pcount.".sff myreads.fasta.".$count.".sff");  
 if(-e $filename){$startfile=$filename;}  
 else {system("sfffile -o combined".$count.".sff ".$startfile." myreads.fasta.".$count.".sff");}