Thanks to tips for dealing with the GWAS catalog, it would be easy to see how many of the GWAS SNP's are in repeats. However, the GWAS catalog has its very own quirks such as naming "chrX" as "23". Hence, downloading a bed file of the GWAS catalog from the UCSC genome browser makes things much simpler.
Running below commands shows 21,824 GWAS SNP's of which 16,763 are unique. All coordinates are in GRCh38/hg38. The "rmsk" table was downloaded from the UCSC genome browser with 5,520,017 records spanning 1,588,381,100 base pairs. This table comes with annotation for repeats with details such as repeat name, repeat class and repeat family. The table had 15,474 unique repeat names.
cut -f 1-2 UCSC_GWAS.bed |grep -v "^#"|wc -l cut -f 1-2 UCSC_GWAS.bed |grep -v "^#"|sort -u|wc -l
sort -k1,1 -k2,2n UCSC_GWAS.bed > sorted.UCSC_GWAS.bed
sort -k1,1 -k2,2n UCSC_rmsk.bed > sorted.UCSC_GWAS.bed
Both of the UCSC table based bed files are sorted for ease of use. Out of the 21,824 GWAS SNPs 6,618 (5222 unique) overlapped annotated repeats.
intersectBed -a sorted.UCSC_GWAS.bed -b sorted.UCSC_rmsk.bed -wa -wb|cut -f 1,2,3,23,24,25,28,29,30 > repeats.gwas.hits
One can perform a fishers exact test (implemented in bedtools) to check if this overlap/non-overlap is significant, given a particular genome size. One would expect that SNP's associated with traits are less likely to be in repeat regions than expected by chance.
wget ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
gunzip GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
samtools faidx GCA_000001405.15_GRCh38_full_analysis_set.fna
cut -f 1-2 GCA_000001405.15_GRCh38_full_analysis_set.fna.fai|sort -k1,1 -k2,2n > genome.file
bedtools fisher -a sorted.UCSC_GWAS.bed -b sorted.UCSC_rmsk.bed -g genome.file
The left tail of the test is significant. This is in line with our expectation that the probability of a site located in a repeat to be not associated with a trait is higher than a site not located in a repeat region.
# Contingency Table
#_________________________________________
# | not in -b | in -b |
# not in -a | 1623119853 | 11545 |
# in -a | 1586321308 | 5222 |
#_________________________________________
# p-values for fisher's exact test
left right two-tail ratio
0.00000 1.00000 0.00000 0.463
Apart from the fisher's test, bedtools has a utility called "jaccard" that provides a single statistic measuring the similarity of the two sets. This statistic ranges from 0 to 1, with 1 being complete identity between the two sets. In the comparison of repeats and GWAS SNP's the score is very small, showing that the sets are not similar.
bedtools jaccard -a sorted.UCSC_GWAS.bed -b sorted.UCSC_rmsk.bed
intersection union-intersection jaccard n_intersections
5222 1586338075 3.29186e-06 5221
Comparing the prevalence of GWAS SNP's in repeats with different levels of divergence from the repeat consensus sequence can tell us some useful things. For example, one would expect the GWAS SNP's to be in regions with higher divergence from the consensus sequence, both due to the ability of these regions to have acquired a function as well as our ability to detect associations in these regions. However, it might be harder to assemble and annotate the younger repeat classes compared to older more diverged repeats. Hence, a comparison of the repeat classes might not be ideal.
intersectBed -b sorted.UCSC_GWAS.bed -a repeat.div.bed -wa > repeat.div.intersect
We created a bed file with the sequence divergence from the repeat consensus using the data available on the repeatmasker website. This was intersected with the GWAS SNP's. We calculated the lengths of the genome that each
level of sequence divergence occupies. We then compared the number of
GWAS SNP's located in each of these levels as a fraction.
read.table(file="repeat.div.bed",header=FALSE)->M
abs(M$V3-M$V2)->M$V5
round(M$V4)->M$V6
aggregate(M$V5,by=list(M$V6),sum)->C
read.table(file="repeat.div.intersect",header=FALSE)->N
abs(N$V3-N$V2)->N$V5
round(N$V4)->N$V6
aggregate(N$V5,by=list(N$V6),sum)->R
merge(C,R,by="Group.1",all=TRUE)->CR
CR$x.y/CR$x.x->CRR
barplot(t(data.matrix(CRR)),beside=TRUE,col=c("darkblue"))
quantile(CRR,na.rm=TRUE)
0% 25% 50% 75% 100%
2.221916e-05 1.114180e-03 2.097250e-03 2.967695e-03 4.996991e-03
From the above quantiles, it appears that the third and second quantile have most of the GWAS SNP's.
> shapiro.test(CRR)
Shapiro-Wilk normality test
data: CRR
W = 0.971, p-value = 0.302
Doing the shapiro test, we are not able to reject the null hypothesis that the data are normally distributed.Based on this, one might say that irrespective of the "age" of the repeat, we are able to detect GWAS SNP's located in them.
The actual number of GWAS SNP's in each of the repeat classes is given below:
The actual number of GWAS SNP's in each of the repeat classes is given below:
DNA DNA? LINE Low_complexity LTR
667 1 3200 13 1428
LTR? RC rRNA Satellite scRNA
8 2 1 7 2
Simple_repeat SINE snRNA Unknown
100 1175 2 12
Number of GWAS SNP's by repeat family looks like below:
5S-Deu-L2 Alu centr CR1 DNA
3 322 5 103 4
DNA? Dong-R4 ERV1 ERV1? ERVK
1 6 417 1 22
ERVL ERVL-MaLR ERVL? Gypsy Gypsy?
401 532 6 35 7
hAT hAT-Ac hAT-Blackjack hAT-Charlie hAT-Tip100
2 2 17 318 67
hAT-Tip100? Helitron L1 L2 Low_complexity
1 2 1976 1066 13
LTR LTR? MIR PiggyBac PiggyBac?
7 8 841 1 2
rRNA RTE-BovB RTE-X Satellite scRNA
1 15 34 2 2
Simple_repeat snRNA TcMar-Mariner TcMar-Tc2 TcMar-Tigger
100 2 3 4 246
tRNA tRNA-Deu tRNA-RTE Unknown
3 2 4 12