Tuesday, December 16, 2014

More than 10% of structural variants in human genome are in potentially problematic regions

535,461,456 bases which corresponds to 16.22% of the genome are considered "problematic regions in the genome assembly". Looking at the structural variants identified by previous studies available in the UCSC tables, we find 26,934 (13.3 % of reported structural variants) overlap these problematic regions.

                                   Study Overlapping Total         Percent_overlapping_problematic_regions
1        1000 Genomes Consortium Phase 1   2067  20554  10.0564367
2  1000 Genomes Consortium Pilot Project    322   7967   4.0416719
3                         Ahn et al 2009    825   3704  22.2732181
4                       Alkan et al 2009    139    177  78.5310734
5                   Altshuler et al 2010    135    844  15.9952607
6                        Arlt et al 2011    347   2522  13.7589215
7                    Banerjee et al 2011      4    733   0.5457026
8                     Bentley et al 2008    405   5646   7.1732200
9                    Campbell et al 2011    281   1180  23.8135593
10                     Conrad et al 2006     89    672  13.2440476
11                     Conrad et al 2009   1144   8489  13.4762634
12                     Cooper et al 2008    135    311  43.4083601
13                   de Smith et al 2007    303   1492  20.3083110
14                       Feuk et al 2005      0      3   0.0000000
15                   Forsberg et al 2012      0      2   0.0000000
16                     Giglio et al 2002      0      1   0.0000000
17                      Gusev et al 2009     15    211   7.1090047
18                      Hinds et al 2006      2     87   2.2988506
19                    Iafrate et al 2004     56    187  29.9465241
20                     Itsara et al 2009    495   4806  10.2996255
21                  Jakobsson et al 2008    165   1424  11.5870787
22                         Ju et al 2010    321   1289  24.9030256
23                       Kidd et al 2008    934   7380  12.6558266
24                       Kidd et al 2010      1     20   5.0000000
25                      Kidd et al 2010b    179    739  24.2219215
26                        Kim et al 2009    126   1300   9.6923077
27                     Korbel et al 2007    411    974  42.1971253
28                       Levy et al 2007    839  10146   8.2692687
29                      Locke et al 2006    233    353  66.0056657
30                  McCarroll et al 2006    100    531  18.8323917
31                  McCarroll et al 2008    282   1313  21.4775324
32                   McKernan et al 2009    581   6925   8.3898917
33                      Mills et al 2006    332   5321   6.2394287
34                       Pang et al 2010    652   6096  10.6955381
35                       Park et al 2010    844   5747  14.6859231
36                      Perry et al 2008    991   2889  34.3025268
37                     Perry et al 2008b    175    341  51.3196481
38                      Pinto et al 2007    304   1029  29.5432459
39                      Redon et al 2006   1372   3308  41.4752116
40                   Schuster et al 2010     90    186  48.3870968
41                      Sebat et al 2004     36     77  46.7532468
42                     Shaikh et al 2009   1133  12844   8.8212395
43                      Sharp et al 2005     81    105  77.1428571
44              Simon-Sanchez et al 2007     64    232  27.5862069
45                 Stefansson et al 2005      0      1   0.0000000
46                     Teague et al 2010    732   4144  17.6640927
47                      Tuzun et al 2005    132    286  46.1538462
48                       Wang et al 2007    213   1291  16.4988381
49                       Wang et al 2008    235   2644   8.8880484
50                    Wheeler et al 2008     17     23  73.9130435
51                       Wong et al 2007    643   5011  12.8317701
52                      Wong et al 2012b   3347  33069  10.1212616
53                         Xu et al 2011   4599  25794  17.8297278
54                      Young et al 2008      3      7  42.8571429
55                        Zhu et al 2011      3      3 100.0000000

Should these structural variants be removed from the UCSC table? Or atleast do they require further validation.

Friday, December 12, 2014

Sequence divergence between human and chimp

Using the data about ancestral state from UCSC we calculate the fraction of sites that are different between Human and Chimp. We can see in below table that on average 14% of the sites are different between human and chimp. ChrX shows a much higher divergence at 23%, which is expected given its lower Ne.

Chromosome Number of sites that are different Total number of sites Percent difference
chr1 574730 4097311 14.03
chr2 571968 4337056 13.19
chr3 460826 3601819 12.79
chr4 483915 3522581 13.74
chr5 432399 3222413 13.42
chr6 430879 3147588 13.69
chr7 426619 2928175 14.57
chr8 380886 2786423 13.67
chr9 333171 2283198 14.59
chr10 351451 2471725 14.22
chr11 354062 2492733 14.20
chr12 326365 2396839 13.62
chr13 234420 1761333 13.31
chr14 228533 1645955 13.88
chr15 207061 1516600 13.65
chr16 244166 1680348 14.53
chr17 215074 1476349 14.57
chr18 190200 1397643 13.61
chr19 207813 1157902 17.95
chr20 164209 1207849 13.60
chr21 106186 719102 14.77
chr22 113969 721194 15.80
chrX 439833 1906519 23.07



14.54

When we look divergence in windows across chromsome 2 and chromosome X we see higher divergence at the centromere and telomere. (see figures below and note the difference in scale of y-axis).



The fraction of sites that are different between human and chimp in a window is correlated with the DAF's we looked at earlier.  The correlation coefficient for chr2 is 0.4 and for chrX it is 0.2. 

Thursday, December 4, 2014

openSNP MDS plot from plink

After downloading all the genotype data from the openSNP website, the first thing to do is to see population structure ofcourse!

Using AIM's (Ancestry Informative Markers) is a rather quick method to determine ancestry using a minimal set of markers. Various groups have used slightly different methods to come up with such markers. Instead of looking at these markers, we look at Lactose tolerance phenotype and the population structure in the genotypes with this trait.


The MDS plot (generated using plink after converting using opengwas) has one weird outlier that does not cluster. It is curious, but i hope they make the data available in a easier to use/standardised format so that i can dig into this.

The number of males that decided to reveal their gender [290/1379(21.03%) ] is double that of females [149/1379(10.8%) ]. Data of birth was shared slightly more reluctantly with 403/1379 (29.22%) sharing their DOB compared to the 439/1379(31.83%) that shared their gender. Not surprisingly most people who shared their gender also shared their DOB (116/149 females and 253/290 males). See below barplot of the distribution of DOB's with mean age of ~40 (minimum is 19 and max is 114). 


Monday, December 1, 2014

Human chr15 SLC24A5 pigmentation gene

Coming back to being human, we look at chr15 DAF's. The most prominent feature is the short_arm (information not available). Having come across the database of recent positive selection across human populations (dbPSHP), i was hoping to see a clear signal of DAF near the SLC24A5 gene. It does not seem to be one of the more prominent DAF peaks. Infact, adding the SV2B gene onto the plot does not bring it near any of the prominent DAF peaks. Will have to see how many of my DAF peaks actually match up those in the dbPSHP.

The correlation coefficient of 0.3277578 between the mean derived allele frequency and Number of CNV's is line with the results from other chromosomes.

The code is not different from before:

 chr <- "chr15";    
 jpeg(paste("DAF.",chr,".jpeg",sep=""),width=1420)    
 par(mfrow=c(2,1))    
 read.table(file=paste("h.",chr,".mean.bed",sep=""),header=F,stringsAsFactors=F)->M    
 plot(as.numeric(M$V2),as.numeric(M$V4),xlab="Position along chromosome",ylab="Mean derived allele Frequency",main=chr)    
 lines(c(10000,17000000),c(0.2,0.2),col="red",lwd=5)    
 text(10000, 0.25,labels="short_arm",col="red")    
 lines(c(101981189,101991189),c(0.3,0.3),col="blue",lwd=5)    
 text(101981189, 0.35,labels="Telomere",col="blue")    
 lines(c(0,10000),c(0.3,0.3),col="blue",lwd=5)    
 text(0, 0.35,labels="Telomere",col="blue")    
 lines(c(48413169,48434589),c(0.25,0.25),col="brown",lwd=5)    
 text(48413169, 0.3,labels="SLC24A5 gene",col="brown")    
 lines(c(91643182,91844539),c(0.25,0.25),col="brown",lwd=5)    
 text(91643182, 0.3,labels="SV2B gene",col="brown")    
 read.table(file=paste("h.",chr,".countdgv.bed",sep=""),header=FALSE)->C    
 plot(C$V2,C$V4,xlab="Position along chromosome",ylab="Count of known structural variants",main=chr)    
 cor.test(as.numeric(M$V4),C$V4,method="spearman")    
 dev.off()