Friday, October 31, 2014

Single nucleotide polymorphisms that are associated with traits based on genome wide studies in human populations are under-represented in known repeat regions

Thanks to tips for dealing with the GWAS catalog, it would be easy to see how many of the GWAS SNP's are in repeats. However, the GWAS catalog has its very own quirks such as naming "chrX" as "23". Hence, downloading a bed file of the GWAS catalog from the UCSC genome browser makes things much simpler.

Running below commands shows 21,824 GWAS SNP's of which 16,763 are unique. All coordinates are in GRCh38/hg38. The "rmsk" table was downloaded from the UCSC genome browser with 5,520,017 records spanning 1,588,381,100 base pairs. This table comes with annotation for repeats with details such as repeat name, repeat class and repeat family. The table had 15,474 unique repeat names.

 cut -f 1-2 UCSC_GWAS.bed |grep -v "^#"|wc -l  
 cut -f 1-2 UCSC_GWAS.bed |grep -v "^#"|sort -u|wc -l  
sort -k1,1 -k2,2n UCSC_GWAS.bed > sorted.UCSC_GWAS.bed
sort -k1,1 -k2,2n UCSC_rmsk.bed > sorted.UCSC_GWAS.bed

Both of the UCSC table based bed files are sorted for ease of use. Out of the 21,824 GWAS SNPs 6,618 (5222 unique) overlapped annotated repeats.

 intersectBed -a sorted.UCSC_GWAS.bed -b sorted.UCSC_rmsk.bed -wa -wb|cut -f 1,2,3,23,24,25,28,29,30 > repeats.gwas.hits  

One can perform a fishers exact test (implemented in bedtools) to check if this overlap/non-overlap is significant, given a particular genome size. One would expect that SNP's associated with traits are less likely to be in repeat regions than expected by chance.

 gunzip GCA_000001405.15_GRCh38_full_analysis_set.fna.gz  
 samtools faidx GCA_000001405.15_GRCh38_full_analysis_set.fna  
 cut -f 1-2 GCA_000001405.15_GRCh38_full_analysis_set.fna.fai|sort -k1,1 -k2,2n > genome.file  
 bedtools fisher -a sorted.UCSC_GWAS.bed -b sorted.UCSC_rmsk.bed -g genome.file  

The left tail of the test is significant. This is in line with our expectation that the probability of a site located in a repeat to be not associated with a trait is higher than a site not located in a repeat region.

 # Contingency Table  
 #      | not in -b  | in -b    |  
 # not in -a | 1623119853  | 11545    |  
 #   in -a | 1586321308  | 5222     |  
 # p-values for fisher's exact test  
 left  right  two-tail    ratio  
 0.00000 1.00000 0.00000 0.463  

Apart from the fisher's test, bedtools has a utility called "jaccard" that provides a single statistic measuring the similarity of the two sets. This statistic ranges from 0 to 1, with 1 being complete identity between the two sets. In the comparison of repeats and GWAS SNP's the score is very small, showing that the sets are not similar.

 bedtools jaccard -a sorted.UCSC_GWAS.bed -b sorted.UCSC_rmsk.bed   
 intersection     union-intersection     jaccard     n_intersections  
 5222     1586338075     3.29186e-06     5221  

Comparing the prevalence of GWAS SNP's  in repeats with different levels of divergence from the repeat consensus sequence can tell us some useful things. For example, one would expect the GWAS SNP's to be in regions with higher divergence from the consensus sequence, both due to the ability of these regions to have acquired a function as well as our ability to detect associations in these regions.  However, it might be harder to assemble and annotate the younger repeat classes compared to older more diverged repeats. Hence, a comparison of the repeat classes might not be ideal.

 intersectBed -b sorted.UCSC_GWAS.bed -a repeat.div.bed -wa > repeat.div.intersect  

We created a bed file with the sequence divergence from the repeat consensus using the data available on the repeatmasker website. This was intersected with the GWAS SNP's. We calculated the lengths of the genome that each level of sequence divergence occupies. We then compared the number of GWAS SNP's located in each of these levels as a fraction.

      0%     25%     50%     75%     100%   
 2.221916e-05 1.114180e-03 2.097250e-03 2.967695e-03 4.996991e-03   

From the above quantiles, it appears that the third and second quantile have most of the GWAS SNP's.

 > shapiro.test(CRR)  
      Shapiro-Wilk normality test  
 data: CRR  
 W = 0.971, p-value = 0.302  

Doing the shapiro test, we are not able to reject the null hypothesis that the data are normally distributed.Based on this, one might say that irrespective of the "age" of the repeat, we are able to detect GWAS SNP's located in them.

The actual number of GWAS SNP's in each of the repeat classes is given below:
       DNA      DNA?      LINE Low_complexity      LTR   
       667       1      3200       13      1428   
      LTR?       RC      rRNA   Satellite     scRNA   
        8       2       1       7       2   
  Simple_repeat      SINE     snRNA    Unknown   
       100      1175       2       12   

Number of GWAS SNP's by repeat family looks like below:

    5S-Deu-L2      Alu     centr      CR1      DNA   
        3      322       5      103       4   
      DNA?    Dong-R4      ERV1     ERV1?      ERVK   
        1       6      417       1       22   
      ERVL   ERVL-MaLR     ERVL?     Gypsy     Gypsy?   
       401      532       6       35       7   
       hAT     hAT-Ac hAT-Blackjack  hAT-Charlie   hAT-Tip100   
        2       2       17      318       67   
   hAT-Tip100?    Helitron       L1       L2 Low_complexity   
        1       2      1976      1066       13   
       LTR      LTR?      MIR    PiggyBac   PiggyBac?   
        7       8      841       1       2   
      rRNA    RTE-BovB     RTE-X   Satellite     scRNA   
        1       15       34       2       2   
  Simple_repeat     snRNA TcMar-Mariner   TcMar-Tc2  TcMar-Tigger   
       100       2       3       4      246   
      tRNA    tRNA-Deu    tRNA-RTE    Unknown   
        3       2       4       12   

Wednesday, October 1, 2014

Drosophila transcriptome reveals interesting patterns in splicing diversity

The Drosophila transcriptome has been analysed in different conditions to understand its diversity. Here, we re-analyse the supplementary materials(only from supplementary table-3) and find "interesting" patterns in splicing diversity.

 #calculate percentage of genes with more than 1 isoform  
 length(S$transcripts[S$transcripts>1])*100/ length(S$transcripts)  
 #calculate percentage of genes with more than 1 protein  
 length(S$proteins[S$proteins>1])*100/ length(S$proteins)  

Over half of the 17,564 annotated genes (10,136; 57%) have more than one transcript isoform. However, only 37% (6,584) of the genes have more than one protein, suggesting that most of the transcript diversity is outside the protein coding regions. While the genes have on average 17.39 transcripts, only 2.59 proteins are found on average per gene. 

This contrast between the number of transcripts and proteins is very pronounced in genes with very large number of transcripts. Infact, the gene "gish" which has 18,972 transcripts has only 142 proteins. See below plot that shows the number of records with the words "exon" and "CDS" from the "gish" gene in the new annotation.

The above figure clearly shows that most "exon" records occur at the beginning and end of the gene. The "CDS" records are restricted to the core of the gene. The same pattern is better captured by the authors of the study using the new metric `per cent spliced in’ (Ψ) index.