A map of recent positive selection in Human genome calculated DAF's apart from various other selection scan statistics. Here we calculate mean DAF's in 50Kb windows for each chromosome and see how it correlates with CNV density.
Each step and the commands used are documented here:
Use mysql to directly query the UCSC genome browser tables to download the ancestral state of Human SNP's in Chimp, Orangutan and Macaque along with the allele frequencies.
After deciding on the ancestral state, the derived allele frequency is printed for each of the sites that is retained. The mean DAF can then be easily calculated using the map command in bedtools.
The mysql interface to UCSC genome browser makes it easy to download all know structural variants from numerous publications.
Now that we have everything in the bed format, use the "map" command from bedtools to calculate count of CNV's and plot it for chromosome 2.
We see a positive correlation between mean derived allele frequency and number of known CNV's per 50Kb window with a rho of 0.2496694.
Each step and the commands used are documented here:
Use mysql to directly query the UCSC genome browser tables to download the ancestral state of Human SNP's in Chimp, Orangutan and Macaque along with the allele frequencies.
for i in chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr22 chrX
do
echo "SELECT t2.chrom,t2.chromStart,t2.chromEnd,t2.name,t2.humanObserved,t2.humanAllele,t2.chimpAllele,t2.orangAllele,t2.macaqueAllele,t1.alleles,t1.alleleFreqs FROM hg19.snp141 AS t1, hg19.snp141OrthoPt4Pa2Rm3 AS t2 WHERE t1.name = t2.name AND t2.chrom='$i'"|mysql --user=genome --host=genome-mysql.cse.ucsc.edu -B -A > human."$i".csv
done
Next, we decide on the ancestral state of each site using parsimony. Sites in which Chimp, Orangutan and Macaque agree are easy to resolve. Similarly, when either the first two (Chimp and Orangutan) or the last two (Orangutan and Macaque) agree, the ancestral state can be decided. In all other cases, we discard the site altogether. Other methods that use Maximum likelihood to infer ancestral state have also been used.
#!/usr/bin/perl
## Input parameters
my $ancfreq = $ARGV[0];
open(FILE1, $ancfreq);
$header=<FILE1>;#read header
while($header=<FILE1>) {
$header=~s/\,/\t/g;
my @parts=split(/\t/,$header);
my $chrom=$parts[0];
my $start=$parts[1];
my $end=$parts[2];
my $chimpA=$parts[6];
my $OrgangA=$parts[7];
my $MacA=$parts[8];
my $anctype= $chimpA. $OrgangA. $MacA;$anctype=~s/(.)(?=.*?\1)//g;
if((length $anctype)==1){$found3++;}
else{
$anctype= $chimpA. $OrgangA;$anctype=~s/(.)(?=.*?\1)//g;
if((length $anctype)==1){$foundfirst2++;}
else{
$anctype= $OrgangA. $MacA;$anctype=~s/(.)(?=.*?\1)//g;
if((length $anctype)==1){$foundlast2++;}
elsif((length $anctype)>2){$trial++;$anctype="N";}
else{$anctype="N";$notfound++;}
}
}#end of else
my $a1=$parts[9];
my $a2=$parts[10];
my $a1freq=$parts[12];
my $a2freq=$parts[13];
my $derivedfreq=0;
if($anctype=~m/$a2/){$derivedfreq=$a1freq;}
elsif($anctype=~m/$a1/){$derivedfreq=$a2freq;}
if($derivedfreq>0){print "$chrom\t$start\t$end\t$derivedfreq\t$chimpA\t$OrgangA\t$MacA\t$anctype\n";}
}
close FILE1;
#hg38.snp141OrthoPt4Pa2Rm3.chrom hg38.snp141OrthoPt4Pa2Rm3.chromStart hg38.snp141OrthoPt4Pa2Rm3.chromEnd hg38.snp141OrthoPt4Pa2Rm3.name hg38.snp141OrthoPt4Pa2Rm3.humanObserved hg38.snp141OrthoPt4Pa2Rm3.humanAllele hg38.snp141OrthoPt4Pa2Rm3.chimpAllele hg38.snp141OrthoPt4Pa2Rm3.orangAllele hg38.snp141OrthoPt4Pa2Rm3.macaqueAllele hg38.snp141.alleles hg38.snp141.alleleFreqs
After deciding on the ancestral state, the derived allele frequency is printed for each of the sites that is retained. The mean DAF can then be easily calculated using the map command in bedtools.
for i in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX
do
bedtools makewindows -g Homo_sapiens.GRCh37.57.dna.concat.fa.fai -w 50000|awk -v c="$i" '$1==c{print $0}' > "$i".50k.bed
perl Afreqbed.pl human."$i".csv > h."$i".bed
bedtools map -a "$i".50k.bed -b h."$i".bed -c 4 -o mean > h."$i".mean.bed
done
The mysql interface to UCSC genome browser makes it easy to download all know structural variants from numerous publications.
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -B -A -e "SELECT * FROM hg19.dgvMerged;" > human_dgvMerged.csv
Just retaining the co-ordinates (along with length) of the CNV's we create a bed file with below command: grep -v "^bin" human_dgvMerged.csv|awk '{print $2,$3,$4,($4-$3)}'|sed 's/ /\t/g' >dgv.bed
Now that we have everything in the bed format, use the "map" command from bedtools to calculate count of CNV's and plot it for chromosome 2.
for i in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX
do
bedtools map -a "$i".50k.bed -b dgv.bed -c 3,4 -o count,sum > h."$i".countdgv.bed
done
chr <- "chr2";
jpeg(paste("DAF.",chr,".jpeg",sep=""),width=1420)
par(mfrow=c(2,1))
read.table(file=paste("h.",chr,".mean.bed",sep=""),header=F,stringsAsFactors=F)->M
plot(as.numeric(M$V2),as.numeric(M$V4),xlab="Position along chromosome",ylab="Mean derived allele Frequency",main=chr)
lines(c(92326171,95326171),c(0.6,0.6),col="red",lwd=5)
text(92326171, 0.65,labels="Centromere",col="red")
lines(c(243189373,243199373),c(0.5,0.5),col="blue",lwd=5)
text(243189373, 0.55,labels="Telomere",col="blue")
lines(c(0,10000),c(0.5,0.5),col="blue",lwd=5)
text(0, 0.55,labels="Telomere",col="blue")
lines(c(90545103,91545103),c(0.7,0.7),col="green",lwd=5)
text(90545103, 0.75,labels="heterochromatin",col="green")
lines(c(136545415,136594750),c(0.6,0.6),col="brown",lwd=5)
text(136545415, 0.65,labels="LCT gene",col="brown")
read.table(file=paste("h.",chr,".countdgv.bed",sep=""),header=FALSE)->C
plot(C$V2,C$V4,xlab="Position along chromosome",ylab="Count of known structural variants",main=chr)
cor.test(as.numeric(M$V4),C$V4,method="spearman")
dev.off()
We see a positive correlation between mean derived allele frequency and number of known CNV's per 50Kb window with a rho of 0.2496694.
No comments:
Post a Comment