Thursday, February 28, 2013

unix sort on multiple columns

For the nth time today, i had to think how to do unix sort on multiple columns either based on numbers or alphabets. Fortunately, i had a coffee before, so i figured out that for sorting a file that looks like this:


chr01 1
chr01 10
chr01 2
chr01 5
chr01 8
chr02 9
chr02 9
chr02 19
chr02 98
chr02 92

you can run the command "sort -k1,1 -k2n,2 sorttest" from GNU coreutilities to get something that looks like this:


chr01 1
chr01 2
chr01 5
chr01 8
chr01 10
chr02 9
chr02 9
chr02 19
chr02 92
chr02 98

and run the command "sort -k1,1 -k2,2 sorttest" to get something that looks like this:

chr01 1
chr01 10
chr01 2
chr01 5
chr01 8
chr02 19
chr02 9
chr02 9
chr02 92
chr02 98

All commands were run using "GNU coreutils 8.12.197-032bb" compiled on September 2011 as part of the Ubuntu OS. 







Saturday, February 16, 2013

The process - inspired by Eugene Myers

Heard the LinnĂ© lecture by Eugene Myers about "Molecular Cell Biology via Bio-Image Informatics" and was so inspired by it, that had to blog about it. Apart being on the author list of the "blast" paper which has over 40,000 citations, he was part of the Human, mouse and fly genomes. 

Well, it seems the amounts of data generated by these projects is not enough for him! So he has decided to work with bio-image informatics, which involves interpretation of images generated by staining different features of species. Location of proteins within cells, location of cells within organs and movement of mouse whiskers in response to stimuli are some of the things he talked about. 

"Technology--->Experiment----> Analysis--->Model" is the process that he talked about. His main interest being in the experiment. Be it an experiment that uses Sanger sequencing technology to generate a human genome which is used build models of human genetics or an experiment that uses powerful microscopy technologies to analyse the relative location of cells in a organism to build a model of development and function. The clarity of his thought and expression was amazing, it is probably how a genius works.




Tuesday, February 12, 2013

Pleiotropy - mutations tend to affect more than 1 phenotypic character

As a continuation of the previous post on the Fundamental concepts in genetics series here is the next topic : "The pleiotropic structure of the genotype–phenotype map: the evolvability of complex organisms". The paper starts of with a description of the goals of genetics and how the Genotype-Phenotype map (GMP) is very useful in understanding genetics.

While pleiotropy has been sometimes defined as changes in one gene affecting traits that are seemingly unrelated, the part about "seemingly unrelated" is dropped probably as it is difficult to define what seems to be related or not.

The cost of complexity hypothesis postulates that complex organisms are fundamentally less evolvable compared to simpler organisms as complex organisms are more pleiotropic. Fisher's geometric model and the microscope analogy are described. While theoretical predictions are available, empirical data has only become available recently. Measurement of pleiotropy and the extent and patterns of pleiotropy are being studies with the datasets that have been generated. 

Measurement of pleiotropy is complicated by at least 2 different cases:
  1. Closely linked genes affecting 2 different traits tend to be co-inherited and can be appear to be due to pleiotropic effects of a single gene.
  2. Shared cis-regulatory element of 2 genes will affect the traits controlled by both genes. This has been called artefactual pleiotropy as pleiotropy can be defined as a character of a genes rather than that of a mutation.
 Apart from the technical difficulties involved is measuring pleiotropic effects, conceptual problems like the definition of a "phenotypic trait or character" make it difficult to have an objective measure of pleiotropy.
  • QTL data tend to overestimate pleiotropic effects due to biases caused due to linked genes
  • Gene knockout and knockdown experiments avoid the problem of closely linked genes but is only able it only measures mutations that lead to complete loss of gene function
  • Mesures of pleiotropy depend on the number and type of traits that are measured in an experiment
  • Traits that are beyond the detection limits of methods used to measure traits also tend to lead to a distorted picture of pleiotropy
  • Theoretical methods to estimate pleiotropy have used the relationship between the genetic load, effective population size and effective dimensionality of the phenotype (average pleiotropy) 
  • Distribution of fitness effects have also been used to predict phenotypic complexity based on predictions of FGM
Correlation among the traits also introduces many issues in estimation of pleiotropy.However, while universal pleiotropy was considered to be the norm, its being increasingly argued that the extent of pleiotropy is very minimal(variational modularity or restricted pleiotropy). With the datasets from Yeast, C.Elegans etc.. it is being seen that the degree of pleiotropy even with the upward biased estimates is rather low. 

While the molecular basis of Pleiotropy is not well studied, two types of pleiotropy were suggested by Gruneberg way back in 1938. Type I (initially called genuine) pleiotropy refers to multiple molecular functions of a single gene product. Type II (initially called spurious) pleiotropy refers to multiple morphological & physiological effects of a single molecular function. While, recent studies have shown type II pleiotropy to be most prevalent. Hence, the pleiotropy seen today could be a result of new biological functions being assigned to the same genes which still have the same molecular functions. However, the extent of pleiotropy is not something that is settled and will probably require a lot of work that would require solving the various issues involved in measuring and analyzing pleiotropy.


Unix for foreach loop with range and vector

To loop through values in unix or linux bash shell one can make use of the "for" loop.

Example: Specify range
 
for i in {1..24};
do
echo $i
done

One can also use C style for loops like

for ((i=1;i<=25;i++));
do
echo $i
done

It can be used like the perl's foreach loop by specifying an array. Note that range can also be descending. 


for i in {1..24} 25 50 {3..4} 100 150 {5..2} qw er st {a..z};
do
echo $i
done

You can also have strings and string range not just numbers.

Sunday, February 3, 2013

Epistasis - interaction between genes


Why are crows black? It seems the answer has two components, the first being the mechanistic and easier to answer while the second is evolutionary. While it might not be the ultimate answer (which is of course 42), it does go much further towards the "why" than just the "how".

Epistasis is simply defined as "interaction between genes". However, in his earlier paper "The Language of Gene Interaction" he explains the 2 slightly different ways the term has been used.

William Bateson is credited with "inventing" the term in 1908-09 to explain the disagreement between the segregation ratios expected based on the action of separate genes and the actual results of a dihybrid cross. The action of one locus(epistatic) masking the effects of alleles at another locus (hypostatic) gives the effect of the epistatic locus "standing upon" the hypostatic locus. However, the term has expanded to cover, 
  1. Functional relationship between genes--Functional epistasis--protein-protein interactions
  2. Genetic ordering of pathways--Compositional epistasis--"measures the effects of allele substitution against a particular fixed genetic background"
  3. The quantitative differences of allele-specific effects--Statistical epistasis--"measures the average effect of allele substitution against the population average genetic background"

Epistasis as a Tool- Flower color in sweet peas

Non-Mendelian  segregation ratio of 9:7 in the cross of two white flowers to produce violet flowers has been attributed to mutations in 2 different genes in the anthocyanin pathway. This framework has been extended to elucidate the order of genes in pathways by using knock-out mutations.

High-throughput approaches that test the effects of all possible combinations of genes in an organism are being done using comprehensive deletion and knockdown libraries along with high-throughput maintenance and screening methods. Both qualitative and quantitative experiments have tested different number of gene knockouts with various genetic backgrounds to understand the interactions between genes. However, the actual number of possible interactions is still a limiting factor. Moreover, interactions need not be a simple presence-absence effect, different expression levels, mutations at various positions in a gene etc can produce other possible combinations of interactions. While knockout libraries have been widely used in Yeast, it is not as easily applicable to many other systems. RNAi knockdown libraries have been used to overcome these limitations. 

Presence of epistatic interactions has been an obstacle to the mapping of many traits to their genes. Recently, few traits with epistatic effects have been identified in human disease genetics. Evolutionary basis of the advent of epistatic effects has been explained by atleast 3 different models. Further analysis of cases involving epistasis is required to understand the role played by evolution in generating or driving evolution.