Monday, November 10, 2014

Human DAF - Derived Allele Frequency plots

A map of recent positive selection in Human genome calculated DAF's apart from various other selection scan statistics. Here we calculate mean DAF's in 50Kb windows for each chromosome and see how it  correlates with CNV density.

Each step and the commands used are documented here:

Use mysql to directly query the UCSC genome browser tables to download the ancestral state of Human SNP's in Chimp, Orangutan and Macaque along with the allele frequencies.

 for i in chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr22 chrX  
 echo "SELECT t2.chrom,t2.chromStart,t2.chromEnd,,t2.humanObserved,t2.humanAllele,t2.chimpAllele,t2.orangAllele,t2.macaqueAllele,t1.alleles,t1.alleleFreqs FROM hg19.snp141 AS t1, hg19.snp141OrthoPt4Pa2Rm3 AS t2 WHERE = AND t2.chrom='$i'"|mysql --user=genome -B -A > human."$i".csv  

Next, we decide on the ancestral state of each site using parsimony. Sites in which Chimp, Orangutan and Macaque agree are easy to resolve. Similarly, when either the first two (Chimp and Orangutan) or the last two (Orangutan and Macaque) agree, the ancestral state can be decided. In all other cases, we discard the site altogether. Other methods that use Maximum likelihood to infer ancestral state have also been used.
 ## Input parameters  
 my $ancfreq = $ARGV[0];  
 open(FILE1, $ancfreq);  
 $header=<FILE1>;#read header  
 while($header=<FILE1>) {  
      my @parts=split(/\t/,$header);  
 my $chrom=$parts[0];  
 my $start=$parts[1];  
 my $end=$parts[2];  
 my $chimpA=$parts[6];  
 my $OrgangA=$parts[7];  
 my $MacA=$parts[8];  
 my $anctype= $chimpA. $OrgangA. $MacA;$anctype=~s/(.)(?=.*?\1)//g;  
  if((length $anctype)==1){$found3++;}  
             $anctype= $chimpA. $OrgangA;$anctype=~s/(.)(?=.*?\1)//g;  
             if((length $anctype)==1){$foundfirst2++;}  
                 $anctype= $OrgangA. $MacA;$anctype=~s/(.)(?=.*?\1)//g;  
                 if((length $anctype)==1){$foundlast2++;}  
                 elsif((length $anctype)>2){$trial++;$anctype="N";}  
         }#end of else  
 my $a1=$parts[9];  
 my $a2=$parts[10];  
 my $a1freq=$parts[12];  
 my $a2freq=$parts[13];  
 my $derivedfreq=0;  
 if($derivedfreq>0){print "$chrom\t$start\t$end\t$derivedfreq\t$chimpA\t$OrgangA\t$MacA\t$anctype\n";}  
 close FILE1;  
 #hg38.snp141OrthoPt4Pa2Rm3.chrom    hg38.snp141OrthoPt4Pa2Rm3.chromStart  hg38.snp141OrthoPt4Pa2Rm3.chromEnd hg38.snp141OrthoPt4Pa2Rm3.humanObserved hg38.snp141OrthoPt4Pa2Rm3.humanAllele hg38.snp141OrthoPt4Pa2Rm3.chimpAllele  hg38.snp141OrthoPt4Pa2Rm3.orangAllele  hg38.snp141OrthoPt4Pa2Rm3.macaqueAllele hg38.snp141.alleles   hg38.snp141.alleleFreqs  

After deciding on the ancestral state, the derived allele frequency is printed for each of the sites that is retained. The mean DAF can then be easily calculated using the map command in bedtools.

 for i in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX  
 bedtools makewindows -g Homo_sapiens.GRCh37.57.dna.concat.fa.fai -w 50000|awk -v c="$i" '$1==c{print $0}' > "$i".50k.bed  
 perl human."$i".csv > h."$i".bed  
 bedtools map -a "$i".50k.bed -b h."$i".bed -c 4 -o mean > h."$i".mean.bed  

The mysql interface to UCSC genome browser makes it easy to download all know structural variants from numerous publications.

 mysql --user=genome -B -A -e "SELECT * FROM hg19.dgvMerged;" > human_dgvMerged.csv    
Just retaining the co-ordinates (along with length) of the CNV's we create a bed file with below command:

 grep -v "^bin" human_dgvMerged.csv|awk '{print $2,$3,$4,($4-$3)}'|sed 's/ /\t/g' >dgv.bed  

Now that we have everything in the bed format, use the "map" command from bedtools to calculate count of CNV's and plot it for chromosome 2.

 for i in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX  
 bedtools map -a "$i".50k.bed -b dgv.bed -c 3,4 -o count,sum > h."$i".countdgv.bed  

 chr <- "chr2";  
 plot(as.numeric(M$V2),as.numeric(M$V4),xlab="Position along chromosome",ylab="Mean derived allele Frequency",main=chr)  
 text(92326171, 0.65,labels="Centromere",col="red")  
 text(243189373, 0.55,labels="Telomere",col="blue")  
 text(0, 0.55,labels="Telomere",col="blue")  
 text(90545103, 0.75,labels="heterochromatin",col="green")  
 text(136545415, 0.65,labels="LCT gene",col="brown")  
 plot(C$V2,C$V4,xlab="Position along chromosome",ylab="Count of known structural variants",main=chr)  

We see a positive correlation between mean derived allele frequency and number of known CNV's per 50Kb window with a rho of 0.2496694.

No comments: