Monday, April 6, 2015

How common is positive selection "inference" in different parts of the human genome?

The database of positive selection provides a handy resource for analyzing patterns of population specific positive selection inference trends and their prevalence in different parts of the human genome. After converting the database into a bed file (loosing some records during conversion makes it easier to run a bunch of analysis on the dataset. 

We next divide the genome into 50Kb chunks and count number of distinct populations in which positive selection was inferred in that window. By this method we can identify "hot-spots" of positive selection in the human genome. Below plot shows the number of populations in which positive selection inference has been done in each of these windows. Only one window on chromosome 2 stretching from 72500000 to 72550000 has been implicated in more than 20 populations. While the Y-chromosome and Mt are empty :) the X chromosome has ~2271 windows with atleast one population.

The above plot shows that it some regions of the genome have been implicated in multiple populations more often than others. Such regions could be termed "hot-spots" of positive selection. However, it reflects the bias in our ability to identify/infer such events rather than actual differences in the occurrence of such events.

Are some populations having positive selection inference spread out over the genome than others? Is it possible to make such a comparison with heterogeneity in the dataset? Methods used are different as well as the resolution of the scans. However, we see(in below figure) how the distribution of such inferences look for populations that have spanned more than 1000 of the ~62,000 windows (50 Kb) spanning the human genome.


MKK has the most number of windows followed by CHB, CEU and YRI. Since, not all observations are independent, it is hard to draw any broadscale conclusions from these numbers.