Sunday, February 8, 2015

Yet another RAD-seq processing tool when genome is available

This is just a simple set of commands to create a RAD-seq processing tool when the genome is available. Since, it accepts reads mapped to genome in a bam file as input, it is capable of handling RAD-seq, exome sequencing as well as low coverage re-sequencing data. Paired-end as well as single end reads are accepted.

 module load samtools  
 module load BEDTools  

 #calculate per site coverage for each individual bam file  
 samtools depth *.sorted.bam > all.coverage  
 #sum the persite coverage across all individuals  
 awk '{for(i=3;i<=NF;i++) t+=$i; print t; t=0}'|sed 's/ /\t/g' all.coverage > persite.coverage  
 #make a histogram of persite coverage across all individuals  
 cut -f 3 persite.coverage |awk '{counts[$1] = counts[$1] + 1; } END { for (val in counts) print val, counts[val]; }'|sort -k1n,1 > persite.coverage.hist  
 #remove sites with more than 200 reads(as inferred from histogram from previous step) per site  
 awk '{for(i=3;i<=NF;i++) t+=$i; if(t<200)print $0; t=0}'|sed 's/ /\t/g' all.coverage > non.repeat.sites  
 #find sites that have atleast 5 reads in 10 individuals  
 awk '{for(i=3;i<=NF;i++) if($i > 5)t++; if(t>10)print $1,$2,$2,t; t=0}' non.repeat.sites|sed 's/ /\t/g' > wellcovered.sites  
 #collapse sites into loci  
 bedtools merge -i wellcovered.sites > wellcovered.sites.bed  
 #call SNP's for selected loci  
 samtools mpileup -l wellcovered.sites.bed -uf ref.fa *.sorted.bam | bcftools view -bvcg - > var.raw.bcf   
 bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf   

No comments: