Nagarjun's blog: 2013

Sunday, December 22, 2013

Pseudogene distribution across the human genome

Pseudogene's are those genes which have lost their ability to code proteins or are not expressed due to other changes in the genome. The Ensemble genes 74 annotation of the human genome hg19 has 15,605 annotated pseudogenes. Based on extensive manual curation and automated predictions, the number of known pseudogenes has increased in the human genome over time.

Distribution of pseudogenes across the human chromosomes (hg19)

Above figure (Ideograph generated by Idiographica ) shows the distribution of the pseudogenes annotated in the latest build of Ensemble. The complete lack of pseudogenes on the small arms of chromsome 13, 14, 15 and 22 is rather striking. A more comprehensive annotation dedicated to the identification and analysis of pseudo genes can be found at pseudogene.org. The latest build consists of 17172 records. The same pattern can be seen even in this more extensive annotation.

While the pattern is striking, it might be due to changes to the chromosome builds affecting the short arms of these chromosomes. However, the possibility of this being biological is of course very interesting. Could it correspond to chromatin type or some other genomic feature? Apparently it does correspond to the hetero-chromatic region of the genome that has not been sequenced.

Monday, November 11, 2013

White hair and assortative mating

Assortative mating is defined as non-random within genotype or phenotype mating. Such non-random patterns of mating are of significance as they lead to population divergence and promote speciation. Due to its significance to population genetics as well as ecology it has been studied in a wide range of species for a traits such as age, behavior, phenology, genotype, ecotype, visual (color) differences among others. While most assortative mating is considered to be positive, few cases of negative assortative mating or diss-assortative mating have been recorded.

Various phenotypes like bird song, body shape, size or color have been seen to be the trait based on which assortment of mating preferences happen. More than one trait can also define mating preferences, as seen in the Northern Cardinalis. Depending on the nature of the trait that leads to assortative mating, consequences for the evolution of that trait are profound.

While ecologists, have no trouble defining assortative mating around phenotypes, geneticists seem to be very picky. Even the third edition of the book "Evolution" by Ridley defines assortative mating as "Tendency of like to mate with like. It can be for a certain genotype (e.g., individuals with genotype AA tend to mate with other individuals of genotype AA) or phenotype (e.g., tall individuals mate with other tall individuals)."

On the other hand in the book "An Introduction to Population Genetics" by Rasmus Nielsen and Montgomery Slatkin define assortative mating as "A mating structure in which pairs of individuals that are (genetically) similar to each other mate with higher probability than expected under random mating."

Is the "genetic" basis of the non-randomness implied or even necessary? Does the lack of a genetic basis make the phenomenon less interesting? The recent review/meta-analysis by Kirkpatrick's group seems to suggest that these and many more questions could help explain trends in the process of species formation.

Monday, June 10, 2013

Randomise order of lines in a file

Different programming languages are good at different things. R has many powerful statistical functions, while perl is good at data handling.

N random numbers from a certain range of numbers without re-sampling can be easily done in R with the "sample" function. To do the same thing in Perl, looping has (or atleast some form of iteration) to be used along with storing the results and checking to avoid re-sampling.

 args<-commandArgs(TRUE)  
 totalsnps<-as.integer(args[1])  
 runumber<-as.integer(args[2])  
 sample(1:totalsnps,totalsnps,replace=F)->N  
 write.table(file=paste("rands.out",runumber,sep="."),N,col.names=F,row.names=F,quote=F)

This Rscript can be run using the below line

 Rscript sampleit.r $linecount $iterationcount

Once, the file with the new order of lines has been generated, it can be used by the perl script to write the file in the new order. We also keep the first two columns of the file unchanged and just randomise the remaining parts of the file.

 #!/usr/bin/perl  
 open RANDS, $ARGV[1] or die $!;  
 my %rhash = ();  
 my $count=1;

#read in the file created by the R script in previous step
 while($lines = <RANDS>){  
 chomp $lines;  
 $rhash{$count}=$lines;  
 $count++;  
 }  
 close RANDS;

#read the file that needs to be randomised and store it in hash with new order
 open STATS, $ARGV[0] or die $!;  
 my $hash = {CHR =>my $genename,POS =>my $pid,RESTATS =>my $pco1};  
 $mycount=1;  
 while($line = <STATS>){  
 chomp $line;  
 @tabs=split(/[ \t]+/,$line);  
 $hash{$mycount}{CHR}=$tabs[0];  
 $hash{$mycount}{POS}=$tabs[1];  
 $line =~ m/\w*\t\w*\t(.*)$/;  
 $hash{$rhash{$mycount}}{RESTATS}=$1;  
 $mycount++;  
 }#end of file while loop

#check if number of lines match in old and new file
 if($mycount!=$count){print "mismatch in counts\n";}

#print the file out in new order
 foreach $contigs (sort { $a <=> $b } keys %hash) {   
 print "$hash{$contigs}{CHR}\t$hash{$contigs}{POS}\t$hash{$contigs}{RESTATS}\n";   
 }

This perl script just reads in the input file, stores it in a hash with new line order and then prints it out.

Friday, June 7, 2013

Distribution of the six most sighted birds

The folks over at the molecular Ecologist have a instructive tutorial on map making using R. The data from the birders in India, provides a novel dataset to try out these new map making skills. To start of with something simple, we look at the distribution of the 6 most sighted birds across India.

Only 6 species of birds have more than 700 recorded observations. The "Rosy Starling" leads the record with 864 observations followed by "Grey Wagtail" (857), "Common Sandpiper"(787), "Barn Swallow" (752),"Pied Cuckoo" (750) and "Greenish Warbler" with 710 observations.

 read.csv("migrantwatch_reports.csv",row.names=NULL)->M  
 colnames(M)<-c("Species","Location name","City","State","Reporter","Date","Sighting type","Observation frequency","Start date","On behalf of","Latitude","Longitude")  
 numberofclusters<-10  
 N<-data.frame(M$Latitude,M$Longitude)  
 kmeans(N, numberofclusters)->P  
 M$cluster<-P$cluster  
 as.data.frame(P$centers)$M.Latitude->lat  
 as.data.frame(P$centers)$M.Longitude->lon  
 BSwallow<-as.data.frame(table(M$cluster[M$Species=="Barn Swallow"]))$Freq  
 CSandpiper<-as.data.frame(table(M$cluster[M$Species=="Common Sandpiper"]))$Freq  
 GWarbler<-as.data.frame(table(M$cluster[M$Species=="Greenish Warbler"]))$Freq  
 GWagtail<-as.data.frame(table(M$cluster[M$Species=="Grey Wagtail"]))$Freq  
 PCuckoo<-as.data.frame(table(M$cluster[M$Species=="Pied Cuckoo"]))$Freq  
 RStarling<-as.data.frame(table(M$cluster[M$Species=="Rosy Starling"]))$Freq  
 library(maps)  
 library(mapdata)  
 library(mapplots)  
 jpeg("top6.map.jpeg")  
 map("worldHires","India",col="gray90", fill=TRUE)  
 for (i in 1:numberofclusters) {  
 add.pie(z=c(BSwallow[i],CSandpiper[i],GWarbler[i],GWagtail[i],PCuckoo[i],RStarling[i]), x=lon[i], y=lat[i], col=c("blue","brown","Green","Grey","black","yellow"), labels="")  
 }  
 legend("topright",c("Barn Swallow","Common Sandpiper","Greenish Warbler","Grey Wagtail","Pied Cuckoo","Rosy Starling"),col=c("blue","brown","Green","Grey","black","yellow"),pch="*")

Running the above code gives a nice(take up the politics with the guys who made the R package) map with pie-charts that looks below image:

The pie-charts are generated for the mean locations that are identified by k-means clustering on all observations with a K of 10. So, this also gives us an idea of where most of the observations come from.

Rosy Starling's are definitely observed more often on the west coast while the Grey Wagtail is seen more often to the south. A large number (337) of the Rosy Starling observations come from Shantilal Varu, whose favorite birds are waders.

Thursday, June 6, 2013

Importance of recording observation frequency

With some feedback from MigrantWatch regarding the post about weekend bias in bird observations, it seems that the "Observation frequency" field is very important indeed. The whole weekend effect does seem to disappear when the data is filtered for only Daily observations.

While, only 3029 of the 22622 observations ~13% of the observations are Daily, the weekend effect does have a nice control and should affect first and last sighting results much less. Just staring at the graphs might not be appealing to the more statistically minded, so doing a test like analysis of variance might be more productive.

 > data.frame(Day=format(as.Date(M$Date), format="%A"),Year=format(as.Date(M$Date), format="%Y"))->WD  
 > as.data.frame(table(WD))->WDD  
 > WDD[WDD$Year %in% c(2007:2013),]->WDDF  
 > aov(Freq~Day,WDDF)->WDDFA  
 > summary(WDDFA)  
       Df Sum Sq Mean Sq F value  Pr(>F)    
 Day     6 3875954 645992 5.0784 0.0005365 ***  
 Residuals  42 5342556 127204             
 ---  
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1   
 > M[M$"Observation frequency" == "Daily",]->MD  
 >   
 > data.frame(Day=format(as.Date(MD$Date), format="%A"),Year=format(as.Date(MD$Date), format="%Y"))->DWD  
 > as.data.frame(table(DWD))->DWDD  
 > DWDD[DWDD$Year %in% c(2007:2013),]->DWDDF  
 > aov(Freq~Day,DWDDF)->DWDDFA  
 > summary(DWDDFA)  
       Df Sum Sq Mean Sq F value Pr(>F)  
 Day     6  466  77.75 0.0548 0.9993  
 Residuals  42 59625 1419.64

While the ANOVA shows a significant effect of Day for the whole dataset, it does not show one for data filtered for only daily observations. A much larger sample size would be needed to ensure that this field is being used in a proper way.

The other point about these observations being affected by Holidays also seems very valid.

 as.data.frame(table(data.frame(Day=format(as.Date(M$Date), format="%j"))))->DY  
 > DY[DY$Freq>200,]  
   Var1 Freq  
 20  020 246  
 27  027 224  
 337 337 261  
 358 358 213  
 365 365 204

By tabulating the observations by day of year, annual holidays like Christmas and New years should be captured. Lo and behold! Christmas (358th day of the year) and New years eve (365th day) are in the top five.

Why does 337th day (Lawyers day in India) have the most observations? May be most of the birders are lawyers!!

Wednesday, June 5, 2013

Bird populations increase during weekends

Yes, the populations of birds increases drastically during weekends. This apparent bias in recording bird sightings during weekends in citizen science databases has been quantified for Europe and North America by Courter et al. They see a decreasing effect of the weekend effect over time for the bird species that they analyzed.

MigrantWatch is a 5 year old citizen science initiative in India that has recorded as many as 22,622 reports as of today. Although the web interface started in 2007, the dataset available for download does have 63 observations that are older, going all the way back to 1910 (Whitehead, CHT. 1911. JBNHS 21).

To get started with analyzing this dataset, it might be interesting to see how the weekend bias has changed over the years. Since, many of these reports are filed by amateur birders it should definitely show a weekend bias. In-fact Dr Jayant Wadatkar fondly remembers a weekend trip to Malkhed Reservoir when he sighted a Common Shelduck for the first time in Maharashtra. While such anecdotal information and common sense would suggest that the weekend effect is real, quantifying its magnitude might be more useful.

So lets tabulate it using the following lines:

 read.csv("migrantwatch_reports.csv",row.names=NULL)->M  
 colnames(M)<-c("Species","Location name","City","State","Reporter","Date","Sighting type","Observation frequency","Start date","On behalf of","Latitude","Longitude")  
 data.frame(Day=format(as.Date(M$Date), format="%A"),Year=format(as.Date(M$Date), format="%Y"))->WD  
 table(WD)

Tabulation of all observations per day of week from 2007 onward

Day	2007	2008	2009	2010	2011	2012	2013	Total
Monday	48	186	217	226	249	662	277	1865
Tuesday	63	249	208	228	278	562	320	1908
Wednesday	74	228	187	236	363	807	291	2186
Thursday	64	228	252	285	270	722	313	2134
Friday	49	244	300	306	306	646	508	2359
Saturday	123	491	477	614	642	1615	587	4549
Sunday	216	848	835	895	954	2477	1333	7558
	637	2474	2476	2790	3062	7491	3629

The weekend (Sunday more than Saturday) effect does seem rather clear just by eye. Observations for each of the weekdays ranges from 8 to 10% while Saturday has 20% and Sunday has more than 30% of the observations.

While this will obviously affect estimates of first sighting and last sighting for a season in the short term, over time this effect should not affect the larger patterns. However, one has to definitely check the effect of the day of week on all results that are obtained.

Due to the nature of the data collection effort the weekend effect is something that one has to live with.

Sunday, May 26, 2013

Trees of whole genomes, but locally

With increasing use of Next-generation sequencing methods, whole genomes of multiple species as well as closely related strains or populations are being generated. Sophisticated models to interpret this data that use our current knowledge of population dynamics and molecular processes are being used extensively.

The general idea is to be able to distinguish the population histories of various parts of the genome. The phylogenies of different parts of the genome might differ based on various functional constraints. Hence, building a single consensus tree for the entire genome will average over these local effects. To be able to overcome this limitation, many methods are being proposed to identify local phylogenies across the genome.

The program Saguaro, from Science for Life Laboratory and Broad Institute, is capable of using SNP data in VCF format or whole genome assembly alignments in MAF format to partition the genome based on local phylogenetic information. The method and its use on few example datasets is described in Unsupervised genome-wide recognition of local relationship patterns.

Get gene start and end from GTF file

In a GTF file that does not have CDS records annotated, its tricky to get the start and end coordinates of a gene. This script does just that and prints a nice BED format file. It also report the genes that occur on multiple scaffolds or chromosomes as errors. These can be greped out using grep -v "Error".

 #!/usr/bin/perl  
 my $gtf1 = $ARGV[0];  
 my %scaf=();  
 my $start=();  
 my %end=();  
 my %multi=();  
 open(GTF1, $gtf1);  
 while($z=<GTF1>){   
 chomp $z;  
 my @parts=split(/\t/,$z);  
 my $start=$parts[3];  
 my $end=$parts[4];  
 my @parts2=split(/\s/,$parts[8]);  
 my $gene=$parts2[1];  
 $gene=~s/[\;\"]//g;  
 ##print "$gene\t$start\t$end\n";  
 if(!(exists $scaf{$gene})){$scaf{$gene}=$parts[0];}  
 elsif(!($scaf{$gene}=~m/$parts[0]/)){$multi{$gene}=1;}  
 if(!(exists $start{$gene})){$start{$gene}=$start;$end{$gene}=$end;}  
 else{$end{$gene}=$end;}  
 }  
 close(GTF1);  
 foreach my $keygene(sort keys %scaf){  
 if($start{$keygene}>$end{$keygene}){$temp=$start{$keygene};$start{$keygene}=$end{$keygene};$end{$keygene}=$temp;}  
 if(!(exists $multi{$keygene})){print "$scaf{$keygene}\t$start{$keygene}\t$end{$keygene}\t$keygene\n";}  
 else{print "Error:Gene $keygene on mutiple scaffolds\n";}  
 }

Saturday, May 25, 2013

Perl script to filter sites with more than 50 percent sites missing by population

In certain cases, its required to have sites from a VCF file filtered by fraction of genotypes missing by population. While entire sites can be filtered using programs like VCFtools, its not possible to filter them by population. Here its assumed that all individuals from $ARGV[1] to $ARGV[2] are in one population and those from $ARGV[3] to $ARGV[4] are in another.

 #!/usr/bin/perl  
 use List::Util qw[min max];  
 # Input parameters  
 my $vcf_file = $ARGV[0];  
 my $firstmissing=0;  
 my $firstall=$ARGV[2]-$ARGV[1];  
 my $secondmissing=0;  
 my $secondall=$ARGV[4]-$ARGV[3];  
 #Go through fasta file, extract sequences  
 open(IN, $vcf_file);  
 while($z=<IN>) {  
 $firstmissing=0;  
 $secondmissing=0;  
      if($z=~m/^#/){print $z;}  
      else{  
           chomp $z;  
           @values=split(/\t/,$z);  
      $totind=length @values;  
      for($i=$ARGV[1];$i<=$ARGV[2];$i++){if($values[$i]=~m/\.\/\./){$firstmissing++;}}  
      for($i=$ARGV[3];$i<=$ARGV[4];$i++){if($values[$i]=~m/\.\/\./){$secondmissing++;}}  
 if((($secondmissing/$secondall)<0.5)&&(($firstmissing/$firstall)<0.5)){print "$z\n";}  
      }#end of else #  
 }#end of file while  
 close IN;

Monday, May 20, 2013

Combining SNP data tables with perl

Continuing with our previous exploits Analyzing in vitro toxicity and expression data, its time to combine SNP data tables using perl.

#!/usr/bin/perl

open CANFAM2, $ARGV[0] or die $!;
open CANFAM3, $ARGV[1] or die $!;

%can2=();
%can3=();

while($line = <CANFAM2>){
chomp $line;
@tabs=split(/\s+/,$line);$tabs[9]=~s/[\;\"]//g;$tabs[11]=~s/[\;\"]//g;
$key=$tabs[9];
$value=$tabs[0] . "_" . $tabs[3] . "_" . $tabs[11];

if(exists $can2{$key}){print "Error:Duplicate record for SNP_ID $tabs[9]. Skipping second occurence.\n";}
else{$can2{$key}=$value;}

}#end of file while loop

while($line = <CANFAM3>){
chomp $line;
@tabs=split(/\s+/,$line);$tabs[9]=~s/[\;\"]//g;$tabs[11]=~s/[\;\"]//g;
$key=$tabs[9];
$value=$tabs[0] . "_" . $tabs[3] . "_" . $tabs[11];


if(exists $can3{$key}){print "Error:Duplicate record for SNP_ID $tabs[9]. Skipping second occurence.\n";}
else{$can3{$key}=$value;}

}#end of file while loop


print "SNP_ID\tbases\tchr_canfam2\tpos_canfam2\tchr_canfam3\tpos_canfam3\tdistance_between_versions\n";
$transitions=0;
$transversions=0;

foreach $mykey(sort keys %can2){
@can2tabs=split(/\_/,$can2{$mykey});

if($can2tabs[2]=~m/(AG|GA|CT|TC)/){$transitions++;}
elsif($can2tabs[2]=~m/(AC|CA|GT|TG)/){$transversions++;}

    if(exists $can3{$mykey}){
        @can3tabs=split(/\_/,$can3{$mykey});
            if($can2tabs[0]=~m/$can3tabs[0]/){$distance=abs($can2tabs[1]-$can3tabs[1]);}else{$distance="NA";}
        print "$mykey\t$can2tabs[2]\t$can2tabs[0]\t$can2tabs[1]\t$can3tabs[0]\t$can3tabs[1]\t$distance\n";
    }
    else {
    print "Error:SNP_ID $mykey missing in canfam3.\n";
    print "$mykey\t$can2tabs[2]\t$can2tabs[0]\t$can2tabs[1]\tNA\tNA\tNA\n";
    }
}

foreach $mykey(sort keys %can3){
    if(!(exists $can2{$mykey})){
    print "Error:SNP_ID $mykey missing in canfam2.\n";
    @can3tabs=split(/\_/,$can3{$mykey});
    print "$mykey\t$can3tabs[2]\tNA\tNA\t$can3tabs[0]\t$can3tabs[1]\tNA\n";

    if($can3tabs[2]=~m/(AG|GA|CT|TC)/){$transitions++;}
    elsif($can3tabs[2]=~m/(AC|CA|GT|TG)/){$transversions++;}
    }
}
$titv=$transitions/$transversions;
print "Info: $transitions Transistions\n";
print "Info: $transversions Transversions\n";
print "Info: $titv Ti/Tv ratio\n";

close CANFAM2;
close CANFAM3;

To do something bonus, lets calculate transitions and transversions to infer the Ti/Tv ratio.

Info: 1852527 Transistions

Info: 504318 Transversions

Info: 3.67333111251234 Ti/Tv ratio

Apart from this, 39,352 SNP's were found in only one of the two versions of the assembly. These could be attributed to gaps in the assembly that were filled at a later stage by Illumina sequencing.

Sunday, May 19, 2013

Toxicity and expression data project

One possible solution to Analyzing in vitro toxicity and expression data perl project. Example table that is used as input to the perl script is given in below table.

"source_name_sid"	"casrn"	"chemical_name"	"assay_component_name"	"conc"	"value"	"value_type"	"plate_id"	"rep"	"row"	"col"	"target"	"time_hr"	"reference"	"media_reference"
"DSSTOX_40294"	"135158-54-2"	"Acibenzolar-S-Methyl"	"CLZD_ABCB1_6"	0.004	1.08	"fold_change"	113	10	"E"	5	"ABCB1"	6	300.5345.75	778
"DSSTOX_40294"	"135158-54-2"	"Acibenzolar-S-Methyl"	"CLZD_ABCB1_6"	0.004	0.96	"fold_change"	113	10	"F"	5	"ABCB1"	6	300.5345.75	778
"DSSTOX_40294"	"135158-54-2"	"Acibenzolar-S-Methyl"	"CLZD_ABCB1_6"	0.004	0.94	"fold_change"	113	10	"G"	5	"ABCB1"	6	300.5345.75	778
"DSSTOX_40294"	"135158-54-2"	"Acibenzolar-S-Methyl"	"CLZD_ABCB1_6"	0.004	0.93	"fold_change"	113	10	"H"	5	"ABCB1"	6	300.5345.75	778
"DSSTOX_40294"	"135158-54-2"	"Acibenzolar-S-Methyl"	"CLZD_ABCB1_6"	0.04	0.94	"fold_change"	113	8	"G"	4	"ABCB1"	6	300.5345.75	778
"DSSTOX_40294"	"135158-54-2"	"Acibenzolar-S-Methyl"	"CLZD_ABCB1_6"	0.04	1.13	"fold_change"	113	8	"E"	4	"ABCB1"	6	300.5345.75	778
"DSSTOX_40294"	"135158-54-2"	"Acibenzolar-S-Methyl"	"CLZD_ABCB1_6"	0.04	0.96	"fold_change"	113	8	"H"	4	"ABCB1"	6	300.5345.75	778
"DSSTOX_40294"	"135158-54-2"	"Acibenzolar-S-Methyl"	"CLZD_ABCB1_6"	0.04	1.07	"fold_change"	113	8	"F"	4	"ABCB1"	6	300.5345.75	778
"DSSTOX_40294"	"135158-54-2"	"Acibenzolar-S-Methyl"	"CLZD_ABCB1_6"	0.4	1.02	"fold_change"	113	6	"F"	3	"ABCB1"	6	300.5345.75	778

Actual perl code:

#!/usr/bin/perl

open TOXTABLE, $ARGV[0] or die $!;

%gcct=();

while($line = <TOXTABLE>){
chomp $line;
@tabs=split(/\t/,$line);

$key=$tabs[2] . "_" . $tabs[4] . "_" . $tabs[11] . "_" . $tabs[12];
$foldchange=$tabs[5];

if(exists $gcct{$key}){if($gcct{$key}<$foldchange){$gcct{$key}=$foldchange;}}
else{$gcct{$key}=$foldchange;}

}#end of file while loop

print "chemical_name\tconctarget\ttime_hr\tvalue\n";

foreach $mykey(sort keys %gcct){
@mytabs=split(/\_/,$mykey);
print "$mytabs[0]\t$mytabs[1]\t$mytabs[2]\t$mytabs[3]\t$gcct{$mykey}\n";
}

Just run the perl script with the filename of the table as first argument and it will print "which chemical compound and at what concentration causes the highest fold change of transcription in each gene at each timepoint".

The same can be done with one line in R using :

 aggregate(M$value,by=list(M$chemical_name,M$conc,M$target,M$time_hr),max)->N

Sunday, May 12, 2013

Defined by the unknown

How would one explain one's linear existence to a being that has no concept of time? Deep Space Nine has a situation in which this transpires. The very philosophical nature of the entire scenario makes it a rather interesting situation.

"It is the unknown that defines our existence. We are constantly searching, not just for answers to our questions, but for new questions. We are explorers. We explore our lives, day by day, and we explore the galaxy, trying to expand the boundaries of our knowledge."

While this might be true of the utopian society that the federation succeeds in building, its far from the truth today. The search is driven more by profit, that even the Firangi would be proud of us.

Friday, April 5, 2013

Ecological Speciation

Ecological Speciation by Patrick Nosil is a book with a red cover with a picture of a fish on its cover. An entire book dedicated to explaining "ecological speciation", its components, outstanding problems and current developments should definitely be comprehensive. The book was written during Nosil's yearlong stay at the institute of advanced study at Berlin, Germany. It has been praised for its clarity and utility to even graduate students who are just entering the field.

While the book was published in 2012, this 2005 paper in Ecology letters gives a good premise to the book. The idea behind ecological speciation is the action of divergent selection that is due to ecological reasons which leads to the evolution of barriers to gene flow leading to the formation of new species. This process is broken down into its 3 constituents, ecologically divergent selection, barrier to gene flow or reproductive isolation and a "genetic" mechanism which links these two.

Why should the mechanism linking divergent selection and reproductive isolation be genetic? Depending on the comprehensiveness of this "genetic" mechanism, could one think of links other than pleiotropy or linkage?. For example, would a scenario that involves speciation (formation of barriers to gene flow) due to some other mechanism expedited by divergent selection be part of ecological speciation? It might be worth thinking about it while reading the chapter thats devoted to the mechanisms linking selection and isolation.

Concepts such as immigrant inviability and isolation by adaptation are also covered. Adaptive speciation is considered a special case of ecological speciation which requires frequency-dependent disruptive selection in sympatry or parapatry. Disruptive selection involves selection for opposite phenotypes within a single population while divergent selection involves selection for opposite phenotypes in different populations.

The book is made up of 9 chapters divided into 3 parts covering 280 pages. So this should definitely be a interesting read.

Saturday, March 16, 2013

Genomes are dynamic entities

Sometime ago, i was wondering about genomes that are impossible to assemble. Differences in the DNA content of Germline and somatic cells is one more oddity that can make genomes much more complex than they might appear. Roundworm genomes of the somatic and germline show a clear change from germline to somatic cells. The loss of genes expressed in the Germline suggests a role for adaptation which might have been taken over by other mechanisms such as methylation, non-coding rna and other silencers of gene expression.

The recently published genome of the lamprey is another species known to have a similar mechanism of genome change. Further studies into the mechanism of such changes, their functional role and evolutionary significance will probably provide some insight into the evolution of tissue specific expression.

The loss of genes expressed in the germline followed by addition of telomeres and location of lost genes near ends of chromosomes also introduces the possibility that loss of germline DNA could be mediated by a telomerase inhibitor such as PINX1. The idea is that PINX1 is expressed in the germline leading to DNA being chewed from the ends of the chromosome. Once, the sequence containing PINX1 is lost, it will lead to expression of telomerase and stop further loss of DNA. To test this hypothesis PINX1 should be available in germline DNA but lost in the somatic genome.

Having obtained the somatic and germline genomes from the ascaris genome website, the PINX1 gene is found to be located on scaffold "AS02549" in the somatic genome and "AG00021" in the germline genome. Hence, it seems unlikely to be the germline DNA loss mechanism. However, its still possible that a trans regulatory element might be lost or some other telomerase inhibitor might be involved. The availability of gene expression data allows one to test this hypothesis. The fact that neither telomerase nor PINX1 is seen in the list of differentialy expressed genes suggests that a different mechanism might be in action.

Thursday, February 28, 2013

unix sort on multiple columns

For the nth time today, i had to think how to do unix sort on multiple columns either based on numbers or alphabets. Fortunately, i had a coffee before, so i figured out that for sorting a file that looks like this:

chr01 1
chr01 10
chr01 2
chr01 5
chr01 8
chr02 9
chr02 9
chr02 19
chr02 98
chr02 92

you can run the command "sort -k1,1 -k2n,2 sorttest" from GNU coreutilities to get something that looks like this:

chr01 1
chr01 2
chr01 5
chr01 8
chr01 10
chr02 9
chr02 9
chr02 19
chr02 92
chr02 98

and run the command "sort -k1,1 -k2,2 sorttest" to get something that looks like this:

chr01 1

chr01 10

chr01 2

chr01 5

chr01 8

chr02 19

chr02 9

chr02 92

chr02 98

All commands were run using "GNU coreutils 8.12.197-032bb" compiled on September 2011 as part of the Ubuntu OS.

Saturday, February 16, 2013

The process - inspired by Eugene Myers

Heard the Linné lecture by Eugene Myers about "Molecular Cell Biology via Bio-Image Informatics" and was so inspired by it, that had to blog about it. Apart being on the author list of the "blast" paper which has over 40,000 citations, he was part of the Human, mouse and fly genomes.

Well, it seems the amounts of data generated by these projects is not enough for him! So he has decided to work with bio-image informatics, which involves interpretation of images generated by staining different features of species. Location of proteins within cells, location of cells within organs and movement of mouse whiskers in response to stimuli are some of the things he talked about.

"Technology--->Experiment----> Analysis--->Model" is the process that he talked about. His main interest being in the experiment. Be it an experiment that uses Sanger sequencing technology to generate a human genome which is used build models of human genetics or an experiment that uses powerful microscopy technologies to analyse the relative location of cells in a organism to build a model of development and function. The clarity of his thought and expression was amazing, it is probably how a genius works.

Tuesday, February 12, 2013

Pleiotropy - mutations tend to affect more than 1 phenotypic character

As a continuation of the previous post on the Fundamental concepts in genetics series here is the next topic : "The pleiotropic structure of the genotype–phenotype map: the evolvability of complex organisms". The paper starts of with a description of the goals of genetics and how the Genotype-Phenotype map (GMP) is very useful in understanding genetics.

While pleiotropy has been sometimes defined as changes in one gene affecting traits that are seemingly unrelated, the part about "seemingly unrelated" is dropped probably as it is difficult to define what seems to be related or not.

The cost of complexity hypothesis postulates that complex organisms are fundamentally less evolvable compared to simpler organisms as complex organisms are more pleiotropic. Fisher's geometric model and the microscope analogy are described. While theoretical predictions are available, empirical data has only become available recently. Measurement of pleiotropy and the extent and patterns of pleiotropy are being studies with the datasets that have been generated.

Measurement of pleiotropy is complicated by at least 2 different cases:

Closely linked genes affecting 2 different traits tend to be co-inherited and can be appear to be due to pleiotropic effects of a single gene.
Shared cis-regulatory element of 2 genes will affect the traits controlled by both genes. This has been called artefactual pleiotropy as pleiotropy can be defined as a character of a genes rather than that of a mutation.

Apart from the technical difficulties involved is measuring pleiotropic effects, conceptual problems like the definition of a "phenotypic trait or character" make it difficult to have an objective measure of pleiotropy.

QTL data tend to overestimate pleiotropic effects due to biases caused due to linked genes
Gene knockout and knockdown experiments avoid the problem of closely linked genes but is only able it only measures mutations that lead to complete loss of gene function
Mesures of pleiotropy depend on the number and type of traits that are measured in an experiment
Traits that are beyond the detection limits of methods used to measure traits also tend to lead to a distorted picture of pleiotropy
Theoretical methods to estimate pleiotropy have used the relationship between the genetic load, effective population size and effective dimensionality of the phenotype (average pleiotropy)
Distribution of fitness effects have also been used to predict phenotypic complexity based on predictions of FGM

Correlation among the traits also introduces many issues in estimation of pleiotropy.However, while universal pleiotropy was considered to be the norm, its being increasingly argued that the extent of pleiotropy is very minimal(variational modularity or restricted pleiotropy). With the datasets from Yeast, C.Elegans etc.. it is being seen that the degree of pleiotropy even with the upward biased estimates is rather low.

While the molecular basis of Pleiotropy is not well studied, two types of pleiotropy were suggested by Gruneberg way back in 1938. Type I (initially called genuine) pleiotropy refers to multiple molecular functions of a single gene product. Type II (initially called spurious) pleiotropy refers to multiple morphological & physiological effects of a single molecular function. While, recent studies have shown type II pleiotropy to be most prevalent. Hence, the pleiotropy seen today could be a result of new biological functions being assigned to the same genes which still have the same molecular functions. However, the extent of pleiotropy is not something that is settled and will probably require a lot of work that would require solving the various issues involved in measuring and analyzing pleiotropy.

Unix for foreach loop with range and vector

To loop through values in unix or linux bash shell one can make use of the "for" loop.

Example: Specify range

for i in {1..24};
do
echo $i
done

One can also use C style for loops like

for ((i=1;i<=25;i++));
do
echo $i
done

It can be used like the perl's foreach loop by specifying an array. Note that range can also be descending.

for i in {1..24} 25 50 {3..4} 100 150 {5..2} qw er st {a..z};
do
echo $i
done

You can also have strings and string range not just numbers.

Sunday, February 3, 2013

Epistasis - interaction between genes

Patrick C. Phillips has written a review on Epistasis as part of the Fundamental concepts in genetics series titled "Epistasis — the essential role of gene interactions in the structure and evolution of genetic systems".

Why are crows black? It seems the answer has two components, the first being the mechanistic and easier to answer while the second is evolutionary. While it might not be the ultimate answer (which is of course 42), it does go much further towards the "why" than just the "how".

Epistasis is simply defined as "interaction between genes". However, in his earlier paper "The Language of Gene Interaction" he explains the 2 slightly different ways the term has been used.

William Bateson is credited with "inventing" the term in 1908-09 to explain the disagreement between the segregation ratios expected based on the action of separate genes and the actual results of a dihybrid cross. The action of one locus(epistatic) masking the effects of alleles at another locus (hypostatic) gives the effect of the epistatic locus "standing upon" the hypostatic locus. However, the term has expanded to cover,

Functional relationship between genes--Functional epistasis--protein-protein interactions
Genetic ordering of pathways--Compositional epistasis--"measures the effects of allele substitution against a particular fixed genetic background"
The quantitative differences of allele-specific effects--Statistical epistasis--"measures the average effect of allele substitution against the population average genetic background"

Epistasis as a Tool- Flower color in sweet peas

Non-Mendelian segregation ratio of 9:7 in the cross of two white flowers to produce violet flowers has been attributed to mutations in 2 different genes in the anthocyanin pathway. This framework has been extended to elucidate the order of genes in pathways by using knock-out mutations.

High-throughput approaches that test the effects of all possible combinations of genes in an organism are being done using comprehensive deletion and knockdown libraries along with high-throughput maintenance and screening methods. Both qualitative and quantitative experiments have tested different number of gene knockouts with various genetic backgrounds to understand the interactions between genes. However, the actual number of possible interactions is still a limiting factor. Moreover, interactions need not be a simple presence-absence effect, different expression levels, mutations at various positions in a gene etc can produce other possible combinations of interactions. While knockout libraries have been widely used in Yeast, it is not as easily applicable to many other systems. RNAi knockdown libraries have been used to overcome these limitations.

Presence of epistatic interactions has been an obstacle to the mapping of many traits to their genes. Recently, few traits with epistatic effects have been identified in human disease genetics. Evolutionary basis of the advent of epistatic effects has been explained by atleast 3 different models. Further analysis of cases involving epistasis is required to understand the role played by evolution in generating or driving evolution.

Tuesday, January 8, 2013

Let It Be

While we have to "Imagine" , we also have to "Let It Be". Much more popular than other songs of the Beatles, this song says "Mother Mary comes to me Speaking words of wisdom". One could argue, that songs popularity comes from its religious connotation, but could also be due to its "wisdom" coming from a mother(Paul's). Another, interpretation i have heard is "Mary" being a herb!! This also makes sense, as "And in my hour of darkness she is standing right in front of me".

This song offers a solution, "And when the broken hearted people living in the world agree There will be an answer". So we can wait for the "broken hearted" to agree and "Let it be".

Sunday, January 6, 2013

Imagine there is no heaven

"Imagine" as its so often called refers to a Song that has been interpreted in various ways, criticized, revered and even worshipped. The deceptively simple to understand lyrics and soothing music have probably contributed to its popularity in the 20th Century.

While many people have equated it to a communist recruitment song, it may indeed be a much bigger idea. In fact, you can imagine,whatever you want the song to mean.

The very idea of a world with "Nothing to kill or die for" is so appealing to any human being irrespective of her religion, political ideology, country etc... its surprising that we still live in a world torn apart by war and strife.

This probably is due to the human need to have something to kill or die for, a purpose for life even. Unless we have a different more attractive reason to "live for", we may just have to be content with dreams.

Only part of the song that baffles me is a "A brotherhood of man", why not a sisterhood of women? or better still a "Fellowship of Humanity"(this is actually the name of a church). One can only hope that, Lenon used the word brotherhood to mean "a group of people or all people"

A world with "no possessions" can be "Imagine"d today only as part of communism. Human pursuit of materialism can just be that "a pursuit". The irony of it was "probably" not lost on Thomas Jefferson who put "Life, liberty and the pursuit of happiness" in the US declaration of Independence. As the human race moves ahead it may "someday" come to realise that as long as we have greed we will have hunger "And the world will live as one".