Nagarjun's blog: 2014

Tuesday, December 16, 2014

More than 10% of structural variants in human genome are in potentially problematic regions

535,461,456 bases which corresponds to 16.22% of the genome are considered "problematic regions in the genome assembly". Looking at the structural variants identified by previous studies available in the UCSC tables, we find 26,934 (13.3 % of reported structural variants) overlap these problematic regions.

Study Overlapping Total Percent_overlapping_problematic_regions

1 1000 Genomes Consortium Phase 1 2067 20554 10.0564367

2 1000 Genomes Consortium Pilot Project 322 7967 4.0416719

3 Ahn et al 2009 825 3704 22.2732181

4 Alkan et al 2009 139 177 78.5310734

5 Altshuler et al 2010 135 844 15.9952607

6 Arlt et al 2011 347 2522 13.7589215

7 Banerjee et al 2011 4 733 0.5457026

8 Bentley et al 2008 405 5646 7.1732200

9 Campbell et al 2011 281 1180 23.8135593

10 Conrad et al 2006 89 672 13.2440476

11 Conrad et al 2009 1144 8489 13.4762634

12 Cooper et al 2008 135 311 43.4083601

13 de Smith et al 2007 303 1492 20.3083110

14 Feuk et al 2005 0 3 0.0000000

15 Forsberg et al 2012 0 2 0.0000000

16 Giglio et al 2002 0 1 0.0000000

17 Gusev et al 2009 15 211 7.1090047

18 Hinds et al 2006 2 87 2.2988506

19 Iafrate et al 2004 56 187 29.9465241

20 Itsara et al 2009 495 4806 10.2996255

21 Jakobsson et al 2008 165 1424 11.5870787

22 Ju et al 2010 321 1289 24.9030256

23 Kidd et al 2008 934 7380 12.6558266

24 Kidd et al 2010 1 20 5.0000000

25 Kidd et al 2010b 179 739 24.2219215

26 Kim et al 2009 126 1300 9.6923077

27 Korbel et al 2007 411 974 42.1971253

28 Levy et al 2007 839 10146 8.2692687

29 Locke et al 2006 233 353 66.0056657

30 McCarroll et al 2006 100 531 18.8323917

31 McCarroll et al 2008 282 1313 21.4775324

32 McKernan et al 2009 581 6925 8.3898917

33 Mills et al 2006 332 5321 6.2394287

34 Pang et al 2010 652 6096 10.6955381

35 Park et al 2010 844 5747 14.6859231

36 Perry et al 2008 991 2889 34.3025268

37 Perry et al 2008b 175 341 51.3196481

38 Pinto et al 2007 304 1029 29.5432459

39 Redon et al 2006 1372 3308 41.4752116

40 Schuster et al 2010 90 186 48.3870968

41 Sebat et al 2004 36 77 46.7532468

42 Shaikh et al 2009 1133 12844 8.8212395

43 Sharp et al 2005 81 105 77.1428571

44 Simon-Sanchez et al 2007 64 232 27.5862069

45 Stefansson et al 2005 0 1 0.0000000

46 Teague et al 2010 732 4144 17.6640927

47 Tuzun et al 2005 132 286 46.1538462

48 Wang et al 2007 213 1291 16.4988381

49 Wang et al 2008 235 2644 8.8880484

50 Wheeler et al 2008 17 23 73.9130435

51 Wong et al 2007 643 5011 12.8317701

52 Wong et al 2012b 3347 33069 10.1212616

53 Xu et al 2011 4599 25794 17.8297278

54 Young et al 2008 3 7 42.8571429

55 Zhu et al 2011 3 3 100.0000000

Should these structural variants be removed from the UCSC table? Or atleast do they require further validation.

Friday, December 12, 2014

Sequence divergence between human and chimp

Using the data about ancestral state from UCSC we calculate the fraction of sites that are different between Human and Chimp. We can see in below table that on average 14% of the sites are different between human and chimp. ChrX shows a much higher divergence at 23%, which is expected given its lower Ne.

Chromosome	Number of sites that are different	Total number of sites	Percent difference
chr1	574730	4097311	14.03
chr2	571968	4337056	13.19
chr3	460826	3601819	12.79
chr4	483915	3522581	13.74
chr5	432399	3222413	13.42
chr6	430879	3147588	13.69
chr7	426619	2928175	14.57
chr8	380886	2786423	13.67
chr9	333171	2283198	14.59
chr10	351451	2471725	14.22
chr11	354062	2492733	14.20
chr12	326365	2396839	13.62
chr13	234420	1761333	13.31
chr14	228533	1645955	13.88
chr15	207061	1516600	13.65
chr16	244166	1680348	14.53
chr17	215074	1476349	14.57
chr18	190200	1397643	13.61
chr19	207813	1157902	17.95
chr20	164209	1207849	13.60
chr21	106186	719102	14.77
chr22	113969	721194	15.80
chrX	439833	1906519	23.07
			14.54

When we look divergence in windows across chromsome 2 and chromosome X we see higher divergence at the centromere and telomere. (see figures below and note the difference in scale of y-axis).

The fraction of sites that are different between human and chimp in a window is correlated with the DAF's we looked at earlier. The correlation coefficient for chr2 is 0.4 and for chrX it is 0.2.

Thursday, December 4, 2014

openSNP MDS plot from plink

After downloading all the genotype data from the openSNP website, the first thing to do is to see population structure ofcourse!

Using AIM's (Ancestry Informative Markers) is a rather quick method to determine ancestry using a minimal set of markers. Various groups have used slightly different methods to come up with such markers. Instead of looking at these markers, we look at Lactose tolerance phenotype and the population structure in the genotypes with this trait.

The MDS plot (generated using plink after converting using opengwas) has one weird outlier that does not cluster. It is curious, but i hope they make the data available in a easier to use/standardised format so that i can dig into this.

The number of males that decided to reveal their gender [290/1379(21.03%) ] is double that of females [149/1379(10.8%) ]. Data of birth was shared slightly more reluctantly with 403/1379 (29.22%) sharing their DOB compared to the 439/1379(31.83%) that shared their gender. Not surprisingly most people who shared their gender also shared their DOB (116/149 females and 253/290 males). See below barplot of the distribution of DOB's with mean age of ~40 (minimum is 19 and max is 114).

Monday, December 1, 2014

Human chr15 SLC24A5 pigmentation gene

Coming back to being human, we look at chr15 DAF's. The most prominent feature is the short_arm (information not available). Having come across the database of recent positive selection across human populations (dbPSHP), i was hoping to see a clear signal of DAF near the SLC24A5 gene. It does not seem to be one of the more prominent DAF peaks. Infact, adding the SV2B gene onto the plot does not bring it near any of the prominent DAF peaks. Will have to see how many of my DAF peaks actually match up those in the dbPSHP.

The correlation coefficient of 0.3277578 between the mean derived allele frequency and Number of CNV's is line with the results from other chromosomes.

The code is not different from before:

 chr <- "chr15";    
 jpeg(paste("DAF.",chr,".jpeg",sep=""),width=1420)    
 par(mfrow=c(2,1))    
 read.table(file=paste("h.",chr,".mean.bed",sep=""),header=F,stringsAsFactors=F)->M    
 plot(as.numeric(M$V2),as.numeric(M$V4),xlab="Position along chromosome",ylab="Mean derived allele Frequency",main=chr)    
 lines(c(10000,17000000),c(0.2,0.2),col="red",lwd=5)    
 text(10000, 0.25,labels="short_arm",col="red")    
 lines(c(101981189,101991189),c(0.3,0.3),col="blue",lwd=5)    
 text(101981189, 0.35,labels="Telomere",col="blue")    
 lines(c(0,10000),c(0.3,0.3),col="blue",lwd=5)    
 text(0, 0.35,labels="Telomere",col="blue")    
 lines(c(48413169,48434589),c(0.25,0.25),col="brown",lwd=5)    
 text(48413169, 0.3,labels="SLC24A5 gene",col="brown")    
 lines(c(91643182,91844539),c(0.25,0.25),col="brown",lwd=5)    
 text(91643182, 0.3,labels="SV2B gene",col="brown")    
 read.table(file=paste("h.",chr,".countdgv.bed",sep=""),header=FALSE)->C    
 plot(C$V2,C$V4,xlab="Position along chromosome",ylab="Count of known structural variants",main=chr)    
 cor.test(as.numeric(M$V4),C$V4,method="spearman")    
 dev.off()

Tuesday, November 25, 2014

Allele frequncies of mutations nearer to genes take time to increase in freequency

Whole genome sequencing of time course evolving populations is now possible. In their experiment, they monitor the allele frequency of all alleles that reach a frequency greater than 0.1. The fact that the allele frequencies are correlated between time points in both methods used to measure allele frequencies suggests that their method is rather robust. (see correlation plot of allele frequencies at different time points[in minutes]).

We next explore the change in allele frequency as function of time and distance from the nearest gene. The below plot shows that the allele frequency increases near to genes only in the later stages of the experiment. The first bar represents all the sites nearest to the genes and the bars next to it are those further away from it. Over the time course, the allele frequency of the mutants increase nearer to the gene.

Thursday, November 20, 2014

Novel avian leptin gene is present across the bird phylogeny

Due to the importance of the leptin gene in endocrinology, its existence and behavior in birds have been the focus of considerable debate. The possible inclusion of human contaminant sequences into early bird genome assemblies seems to have been a major complicating factor in understanding Leptin genes in birds. Only recently, with the availability of the Pigeon genome with a well annotated leptin gene has it become evident that birds have a novel gene that has sequence similarity to the mammalian leptin. However, debate regarding its function, tissue specificity and expression patterns are yet to be resolved conclusively.

With the availability of numerous bird genome assemblies, courtesy of the NGS revolution in sequencing it is now possible to visualize and may be even quantify the prevalence of the leptin gene across different bird groups. Without debating the naming of the leptin like gene annotated in the Pigeon, we take it as given for all our analysis. A simple blastX search provides numerous significant hits.

Two of the significant hits are from birds, specifically Taeniopygia guttata leptin-like protein precursor and Melopsittacus undulatus leptin precursor. We also get many hits from reptiles before moving to mammals.

Tuesday, November 18, 2014

Optical map dot plot

Using Optical Maps to scaffold genomes of bacteria seems to be a fairly common practice. While it is yet to find widespread use in Eukaryotic species, it has found some success in species such as the cow, rice and parrot.

Here we use the assembled Optical map provided as an example with the SOMA package to demonstrate Optical map dot plot creation.

First, we perform an all vs all search between the optical map and in-silico digest of the genome using below script. The script has the parameter called search radius which decides how much flexibility is to be allowed while searching.

 #!/usr/bin/perl   
 use warnings;   
 use Math::Round qw(nearest);  
 use Set::IntSpan;  
 #  
 #perl Map_Optical.pl scaffold.silico soma_format_maps/1.map  
 ## Input parameters  
 %cutlens=();  
 my $roundoffto=10;  
 my $searchradius=200;  
 open CUTSITES, $ARGV[0] or die $!;  
 while($line = <CUTSITES>){  
 chomp $line;  
 my @header = split / /, $line;  
 my $scaffold=$header[0];  
 my $scaflen=$header[1];  
 my $totalcutsites= $header[2];  
 $line = <CUTSITES>;chomp $line;  
 my @cutsites = split / /, $line;  
 my $iso="";  
 my $cutcount=0;  
 my $previous=0;  
   foreach $iso (@cutsites){  
    my $cutlength=$iso-$previous;  
    if($previous>0){  
    $cutcount++;  
    $key=$scaffold . "#" . $cutcount;  
    $cutlens{$cutlength}{$key}=$previous . "_" . $iso;  
    }  
    $previous=$iso;  
    }  
          }#end of while loop  
 close CUTSITES;  
 #print "Done reading Silico-digest provided in file: $ARGV[0]\n";  
 my $mapsite=0;  
 my %mapmaps=();  
 open SOMAP, $ARGV[1] or die $!;   
 while($line = <SOMAP>){   
 chomp $line;  
 my @header2 = split / /, $line;  
 $mapsite++;  
 my $testvalue=$header2[0];  
 my $teststart=$testvalue-$searchradius;  
 my $testend=$testvalue+$searchradius;  
 #print "Searching for $testvalue lengths with a search radius of $searchradius.\n";  
 my $i="";  
 for($i=$teststart;$i<$testend;$i++){  
    foreach my $key (keys %{$cutlens{$i}}) {  
    my @keysp = split /\#/, $key;  
      my @beds= split /\_/, $cutlens{$i}{$key};  
 # scaffold     start     end     map_position     search_radius_hit     map_frag_len     cut_frag_len  
 #   print "$keysp[0]\t$beds[0]\t$mapsite\t$i\t$testvalue\t$keysp[1]\n";  
 #   print "$keysp[0]\t$beds[1]\t$mapsite\t$i\t$testvalue\t$keysp[1]\n";  
 #   print "$keysp[0]\tNA\tNA\n"  
      print "$keysp[0]\t$beds[0]\t$beds[1]\t$mapsite\t$i\t$testvalue\t$keysp[1]\n";  
    }  
 }  
 }#end of while loop  
 close SOMAP;

After performing a search of the Optical map, the output bed file can be converted into a dotplot file using below script:

 #!/usr/bin/perl  
 use warnings;  
 ##  
 ##perl getfull.pl genome.fa.fai assembled.map.bed   
 ### Input parameters  
 #  
 my %cumlen=();  
 my $cumcount=0;  
 open CUTSITES, $ARGV[0] or die $!;  
 while($line = <CUTSITES>){  
 chomp $line;  
 my @header = split /\t/, $line;  
 $cumlen{$header[0]}=$cumcount;  
 $cumcount=$cumcount+$header[1];  
 }  
 close CUTSITES;  
 open MAPSITES, $ARGV[1] or die $!;  
 while($line = <MAPSITES>){  
 chomp $line;  
 my @header = split /\t/, $line;  
 my $start=$header[1]+$cumlen{$header[0]};  
 my $end=$header[2]+$cumlen{$header[0]};  
 print "Map\t$start\t$header[3]\n";  
 print "Map\t$end\t$header[3]\n";  
 print "Map\tNA\tNA\n";  
 }  
 close MAPSITES;

The figure below shows the dotplot of the Optical map created using above scripts with a search radius of 200 base pairs. Increasing the search radius saturates the plot while allowing for more distant hits.

Thursday, November 13, 2014

Human chr6 also has the MHC region which shows a DAF peak

Human chr6 has another prominent peak of DAF, corresponding to the MHC region. Flanked by the genes MOG and COL11A2, it spans from approximately 29624758 to 33145670 in the hg19 assembly. Below figure shows the MHC region in blue with the flanking genes in brown. The number of CNV's also jumps up in this region.

 chr <- "chr6";    
 jpeg(paste("DAF.",chr,".jpeg",sep=""),width=1420)    
 par(mfrow=c(2,1))    
 read.table(file=paste("h.",chr,".mean.bed",sep=""),header=F,stringsAsFactors=F)->M    
 plot(as.numeric(M$V2),as.numeric(M$V4),xlab="Position along chromosome",ylab="Mean derived allele Frequency",main=chr)    
 lines(c(58830166,61830166),c(0.2,0.2),col="red",lwd=5)    
 text(63830166, 0.25,labels="Centromere",col="red")    
 lines(c(171105067,171115067),c(0.3,0.3),col="blue",lwd=5)    
 text(171105067, 0.35,labels="Telomere",col="blue")    
 lines(c(0,10000),c(0.3,0.3),col="blue",lwd=5)    
 text(0, 0.35,labels="Telomere",col="blue")    
 lines(c(57182422,57513376),c(0.25,0.25),col="brown",lwd=5)    
 text(57182422, 0.3,labels="PRIM2 gene",col="brown")    
 M[M$V2>57182422 & M$V3<57513376,]->N   
 points(N$V2,N$V4,pch=13,col="blue")   
 lines(c(33130469,33145670),c(0.35,0.35),col="brown",lwd=5)    
 text(39945670, 0.35,labels="COL11A2 gene",col="brown")    
 lines(c(29624758,29638656),c(0.35,0.35),col="brown",lwd=5)    
 text(25000000, 0.35,labels="MOG gene",col="brown")    
 lines(c(29624758,33145670),c(0.38,0.38),col="blue",lwd=5)    
 text(29624758, 0.42,labels="MHC",col="blue")    
 read.table(file=paste("h.",chr,".countdgv.bed",sep=""),header=FALSE)->C    
 plot(C$V2,C$V4,xlab="Position along chromosome",ylab="Count of known structural variants",main=chr)    
 cor.test(as.numeric(M$V4),C$V4,method="spearman")    
 dev.off()

Wednesday, November 12, 2014

Human chr6 DAF peak contains PRIM2 gene

Continuing the discussion on DAF's, we look at chr6 and specifically the PRIM2 gene region thought to be under balancing selection.

 chr <- "chr6";   
 jpeg(paste("DAF.",chr,".jpeg",sep=""),width=1420)   
 par(mfrow=c(2,1))   
 read.table(file=paste("h.",chr,".mean.bed",sep=""),header=F,stringsAsFactors=F)->M   
 plot(as.numeric(M$V2),as.numeric(M$V4),xlab="Position along chromosome",ylab="Mean derived allele Frequency",main=chr)   
 lines(c(58830166,61830166),c(0.2,0.2),col="red",lwd=5)   
 text(63830166, 0.25,labels="Centromere",col="red")   
 lines(c(171105067,171115067),c(0.3,0.3),col="blue",lwd=5)   
 text(171105067, 0.35,labels="Telomere",col="blue")   
 lines(c(0,10000),c(0.3,0.3),col="blue",lwd=5)   
 text(0, 0.35,labels="Telomere",col="blue")   
 lines(c(57182422,57513376),c(0.25,0.25),col="brown",lwd=5)   
 text(57182422, 0.3,labels="PRIM2 gene",col="brown")   
 M[M$V2>57182422 & M$V3<57513376,]->N  
 points(N$V2,N$V4,pch=13,col="blue")  
 read.table(file=paste("h.",chr,".countdgv.bed",sep=""),header=FALSE)->C   
 plot(C$V2,C$V4,xlab="Position along chromosome",ylab="Count of known structural variants",main=chr)   
 cor.test(as.numeric(M$V4),C$V4,method="spearman")   
 dev.off()

The correlation coefficient of 0.2559672 between the mean derived allele frequency and Number of CNV's is line with the results from chr2. The PRIM2 gene that has been shown to have high values of diversity and Tajima's D is located near the centromere and has high mean DAF values. The windows within the PRIM2 gene are marked by blue crosses.

Monday, November 10, 2014

Human DAF - Derived Allele Frequency plots

A map of recent positive selection in Human genome calculated DAF's apart from various other selection scan statistics. Here we calculate mean DAF's in 50Kb windows for each chromosome and see how it correlates with CNV density.

Each step and the commands used are documented here:

Use mysql to directly query the UCSC genome browser tables to download the ancestral state of Human SNP's in Chimp, Orangutan and Macaque along with the allele frequencies.

 for i in chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr22 chrX  
 do  
 echo "SELECT t2.chrom,t2.chromStart,t2.chromEnd,t2.name,t2.humanObserved,t2.humanAllele,t2.chimpAllele,t2.orangAllele,t2.macaqueAllele,t1.alleles,t1.alleleFreqs FROM hg19.snp141 AS t1, hg19.snp141OrthoPt4Pa2Rm3 AS t2 WHERE t1.name = t2.name AND t2.chrom='$i'"|mysql --user=genome --host=genome-mysql.cse.ucsc.edu -B -A > human."$i".csv  
 done

Next, we decide on the ancestral state of each site using parsimony. Sites in which Chimp, Orangutan and Macaque agree are easy to resolve. Similarly, when either the first two (Chimp and Orangutan) or the last two (Orangutan and Macaque) agree, the ancestral state can be decided. In all other cases, we discard the site altogether. Other methods that use Maximum likelihood to infer ancestral state have also been used.

 #!/usr/bin/perl  
 ## Input parameters  
 my $ancfreq = $ARGV[0];  
 open(FILE1, $ancfreq);  
 $header=<FILE1>;#read header  
 while($header=<FILE1>) {  
      $header=~s/\,/\t/g;  
      my @parts=split(/\t/,$header);  
 my $chrom=$parts[0];  
 my $start=$parts[1];  
 my $end=$parts[2];  
 my $chimpA=$parts[6];  
 my $OrgangA=$parts[7];  
 my $MacA=$parts[8];  
 my $anctype= $chimpA. $OrgangA. $MacA;$anctype=~s/(.)(?=.*?\1)//g;  
  if((length $anctype)==1){$found3++;}  
         else{  
             $anctype= $chimpA. $OrgangA;$anctype=~s/(.)(?=.*?\1)//g;  
             if((length $anctype)==1){$foundfirst2++;}  
                 else{  
                 $anctype= $OrgangA. $MacA;$anctype=~s/(.)(?=.*?\1)//g;  
                 if((length $anctype)==1){$foundlast2++;}  
                 elsif((length $anctype)>2){$trial++;$anctype="N";}  
                 else{$anctype="N";$notfound++;}  
                 }  
         }#end of else  
 my $a1=$parts[9];  
 my $a2=$parts[10];  
 my $a1freq=$parts[12];  
 my $a2freq=$parts[13];  
 my $derivedfreq=0;  
 if($anctype=~m/$a2/){$derivedfreq=$a1freq;}  
 elsif($anctype=~m/$a1/){$derivedfreq=$a2freq;}  
 if($derivedfreq>0){print "$chrom\t$start\t$end\t$derivedfreq\t$chimpA\t$OrgangA\t$MacA\t$anctype\n";}  
      }  
 close FILE1;  
 #hg38.snp141OrthoPt4Pa2Rm3.chrom    hg38.snp141OrthoPt4Pa2Rm3.chromStart  hg38.snp141OrthoPt4Pa2Rm3.chromEnd   hg38.snp141OrthoPt4Pa2Rm3.name hg38.snp141OrthoPt4Pa2Rm3.humanObserved hg38.snp141OrthoPt4Pa2Rm3.humanAllele hg38.snp141OrthoPt4Pa2Rm3.chimpAllele  hg38.snp141OrthoPt4Pa2Rm3.orangAllele  hg38.snp141OrthoPt4Pa2Rm3.macaqueAllele hg38.snp141.alleles   hg38.snp141.alleleFreqs

After deciding on the ancestral state, the derived allele frequency is printed for each of the sites that is retained. The mean DAF can then be easily calculated using the map command in bedtools.

 for i in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX  
 do  
 bedtools makewindows -g Homo_sapiens.GRCh37.57.dna.concat.fa.fai -w 50000|awk -v c="$i" '$1==c{print $0}' > "$i".50k.bed  
 perl Afreqbed.pl human."$i".csv > h."$i".bed  
 bedtools map -a "$i".50k.bed -b h."$i".bed -c 4 -o mean > h."$i".mean.bed  
 done

The mysql interface to UCSC genome browser makes it easy to download all know structural variants from numerous publications.

 mysql --user=genome --host=genome-mysql.cse.ucsc.edu -B -A -e "SELECT * FROM hg19.dgvMerged;" > human_dgvMerged.csv

Just retaining the co-ordinates (along with length) of the CNV's we create a bed file with below command:

 grep -v "^bin" human_dgvMerged.csv|awk '{print $2,$3,$4,($4-$3)}'|sed 's/ /\t/g' >dgv.bed

Now that we have everything in the bed format, use the "map" command from bedtools to calculate count of CNV's and plot it for chromosome 2.

 for i in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX  
 do  
 bedtools map -a "$i".50k.bed -b dgv.bed -c 3,4 -o count,sum > h."$i".countdgv.bed  
 done

 chr <- "chr2";  
 jpeg(paste("DAF.",chr,".jpeg",sep=""),width=1420)  
 par(mfrow=c(2,1))  
 read.table(file=paste("h.",chr,".mean.bed",sep=""),header=F,stringsAsFactors=F)->M  
 plot(as.numeric(M$V2),as.numeric(M$V4),xlab="Position along chromosome",ylab="Mean derived allele Frequency",main=chr)  
 lines(c(92326171,95326171),c(0.6,0.6),col="red",lwd=5)  
 text(92326171, 0.65,labels="Centromere",col="red")  
 lines(c(243189373,243199373),c(0.5,0.5),col="blue",lwd=5)  
 text(243189373, 0.55,labels="Telomere",col="blue")  
 lines(c(0,10000),c(0.5,0.5),col="blue",lwd=5)  
 text(0, 0.55,labels="Telomere",col="blue")  
 lines(c(90545103,91545103),c(0.7,0.7),col="green",lwd=5)  
 text(90545103, 0.75,labels="heterochromatin",col="green")  
 lines(c(136545415,136594750),c(0.6,0.6),col="brown",lwd=5)  
 text(136545415, 0.65,labels="LCT gene",col="brown")  
 read.table(file=paste("h.",chr,".countdgv.bed",sep=""),header=FALSE)->C  
 plot(C$V2,C$V4,xlab="Position along chromosome",ylab="Count of known structural variants",main=chr)  
 cor.test(as.numeric(M$V4),C$V4,method="spearman")  
 dev.off()

We see a positive correlation between mean derived allele frequency and number of known CNV's per 50Kb window with a rho of 0.2496694.

Monday, November 3, 2014

Liftover - what gets lost in translation?

Many genomic analysis require the conversion of genomic coordinates of one genome into another. When a new version of the genome assembly is created, all the data (such as annotation, recombination map, variant calls, genomic features) associated with the older version of the genome need to be "lifted" over to the new genome.

Similarly, when performing comparative genomic analysis, the homologous regions of genomes need to be identified between genomes that are sometime rather distant. The initial draft annotations of genomes use the lifted over annotations from high quality genomes as the primary raw material. This is an extremely important source of information when RNAseq data is not available.

Given the increasing importance of such lifting over of genomic coordinates, new tools that can perform this rather simple sounding task are increasing in number. Apart from the established liftOver utility from UCSC, CrossMap, Kraken, NCBI genome remapping service also provide well established pipelines for converting the genomic coordinates for files in different formats such as bed, gff, SAM(& BAM), VCF, Wiggle, BigWig, etc. The UCSC liftOver tool has been implemented in the GALAXY suite, used in VTools, pyliftOver and rtracklayer (a Bioconductor package). All implementations of the UCSC liftover tool as well as CrossMap use the chain files that require many steps to create. If you are lucky enough to be working on one of the genome supported by UCSC, you can download the ready to use chain files directly from the UCSC website.

The original paper that introduced the liftOver utility has some useful technical information about how the the process works and also performance in different regions of the genome. Kraken, uses a different method that does not require all vs all alignments to perform the lifting over of coordinates. One the key features of the liftOver utility is its ability to "handle large gaps" in alignment. Due to its robustness and ease of use, the liftOver utility has been used in numerous studies in very different studies.

In a series of blog posts, we will analyze and quantify the performance of the UCSC liftOver utility.

Friday, October 31, 2014

Single nucleotide polymorphisms that are associated with traits based on genome wide studies in human populations are under-represented in known repeat regions

Thanks to tips for dealing with the GWAS catalog, it would be easy to see how many of the GWAS SNP's are in repeats. However, the GWAS catalog has its very own quirks such as naming "chrX" as "23". Hence, downloading a bed file of the GWAS catalog from the UCSC genome browser makes things much simpler.

Running below commands shows 21,824 GWAS SNP's of which 16,763 are unique. All coordinates are in GRCh38/hg38. The "rmsk" table was downloaded from the UCSC genome browser with 5,520,017 records spanning 1,588,381,100 base pairs. This table comes with annotation for repeats with details such as repeat name, repeat class and repeat family. The table had 15,474 unique repeat names.

 cut -f 1-2 UCSC_GWAS.bed |grep -v "^#"|wc -l  
 cut -f 1-2 UCSC_GWAS.bed |grep -v "^#"|sort -u|wc -l

sort -k1,1 -k2,2n UCSC_GWAS.bed > sorted.UCSC_GWAS.bed

sort -k1,1 -k2,2n UCSC_rmsk.bed > sorted.UCSC_GWAS.bed

Both of the UCSC table based bed files are sorted for ease of use. Out of the 21,824 GWAS SNPs 6,618 (5222 unique) overlapped annotated repeats.

 intersectBed -a sorted.UCSC_GWAS.bed -b sorted.UCSC_rmsk.bed -wa -wb|cut -f 1,2,3,23,24,25,28,29,30 > repeats.gwas.hits

One can perform a fishers exact test (implemented in bedtools) to check if this overlap/non-overlap is significant, given a particular genome size. One would expect that SNP's associated with traits are less likely to be in repeat regions than expected by chance.

 wget ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz  
 gunzip GCA_000001405.15_GRCh38_full_analysis_set.fna.gz  
 samtools faidx GCA_000001405.15_GRCh38_full_analysis_set.fna  
 cut -f 1-2 GCA_000001405.15_GRCh38_full_analysis_set.fna.fai|sort -k1,1 -k2,2n > genome.file  
 bedtools fisher -a sorted.UCSC_GWAS.bed -b sorted.UCSC_rmsk.bed -g genome.file

The left tail of the test is significant. This is in line with our expectation that the probability of a site located in a repeat to be not associated with a trait is higher than a site not located in a repeat region.

 # Contingency Table  
 #_________________________________________  
 #      | not in -b  | in -b    |  
 # not in -a | 1623119853  | 11545    |  
 #   in -a | 1586321308  | 5222     |  
 #_________________________________________  
 # p-values for fisher's exact test  
 left  right  two-tail    ratio  
 0.00000 1.00000 0.00000 0.463

Apart from the fisher's test, bedtools has a utility called "jaccard" that provides a single statistic measuring the similarity of the two sets. This statistic ranges from 0 to 1, with 1 being complete identity between the two sets. In the comparison of repeats and GWAS SNP's the score is very small, showing that the sets are not similar.

 bedtools jaccard -a sorted.UCSC_GWAS.bed -b sorted.UCSC_rmsk.bed   
 intersection     union-intersection     jaccard     n_intersections  
 5222     1586338075     3.29186e-06     5221

Comparing the prevalence of GWAS SNP's in repeats with different levels of divergence from the repeat consensus sequence can tell us some useful things. For example, one would expect the GWAS SNP's to be in regions with higher divergence from the consensus sequence, both due to the ability of these regions to have acquired a function as well as our ability to detect associations in these regions. However, it might be harder to assemble and annotate the younger repeat classes compared to older more diverged repeats. Hence, a comparison of the repeat classes might not be ideal.

 intersectBed -b sorted.UCSC_GWAS.bed -a repeat.div.bed -wa > repeat.div.intersect

We created a bed file with the sequence divergence from the repeat consensus using the data available on the repeatmasker website. This was intersected with the GWAS SNP's. We calculated the lengths of the genome that each level of sequence divergence occupies. We then compared the number of GWAS SNP's located in each of these levels as a fraction.

 read.table(file="repeat.div.bed",header=FALSE)->M  
 abs(M$V3-M$V2)->M$V5  
 round(M$V4)->M$V6  
 aggregate(M$V5,by=list(M$V6),sum)->C  
 read.table(file="repeat.div.intersect",header=FALSE)->N  
 abs(N$V3-N$V2)->N$V5  
 round(N$V4)->N$V6  
 aggregate(N$V5,by=list(N$V6),sum)->R  
 merge(C,R,by="Group.1",all=TRUE)->CR  
 CR$x.y/CR$x.x->CRR  
 barplot(t(data.matrix(CRR)),beside=TRUE,col=c("darkblue"))  
 quantile(CRR,na.rm=TRUE)  
      0%     25%     50%     75%     100%   
 2.221916e-05 1.114180e-03 2.097250e-03 2.967695e-03 4.996991e-03

From the above quantiles, it appears that the third and second quantile have most of the GWAS SNP's.

 > shapiro.test(CRR)  
      Shapiro-Wilk normality test  
 data: CRR  
 W = 0.971, p-value = 0.302

Doing the shapiro test, we are not able to reject the null hypothesis that the data are normally distributed.Based on this, one might say that irrespective of the "age" of the repeat, we are able to detect GWAS SNP's located in them.

The actual number of GWAS SNP's in each of the repeat classes is given below:

       DNA      DNA?      LINE Low_complexity      LTR   
       667       1      3200       13      1428   
      LTR?       RC      rRNA   Satellite     scRNA   
        8       2       1       7       2   
  Simple_repeat      SINE     snRNA    Unknown   
       100      1175       2       12

Number of GWAS SNP's by repeat family looks like below:

    5S-Deu-L2      Alu     centr      CR1      DNA   
        3      322       5      103       4   
      DNA?    Dong-R4      ERV1     ERV1?      ERVK   
        1       6      417       1       22   
      ERVL   ERVL-MaLR     ERVL?     Gypsy     Gypsy?   
       401      532       6       35       7   
       hAT     hAT-Ac hAT-Blackjack  hAT-Charlie   hAT-Tip100   
        2       2       17      318       67   
   hAT-Tip100?    Helitron       L1       L2 Low_complexity   
        1       2      1976      1066       13   
       LTR      LTR?      MIR    PiggyBac   PiggyBac?   
        7       8      841       1       2   
      rRNA    RTE-BovB     RTE-X   Satellite     scRNA   
        1       15       34       2       2   
  Simple_repeat     snRNA TcMar-Mariner   TcMar-Tc2  TcMar-Tigger   
       100       2       3       4      246   
      tRNA    tRNA-Deu    tRNA-RTE    Unknown   
        3       2       4       12

Wednesday, October 1, 2014

Drosophila transcriptome reveals interesting patterns in splicing diversity

The Drosophila transcriptome has been analysed in different conditions to understand its diversity. Here, we re-analyse the supplementary materials(only from supplementary table-3) and find "interesting" patterns in splicing diversity.

 read.table(file="supp3",header=T)->S  
 #calculate percentage of genes with more than 1 isoform  
 length(S$transcripts[S$transcripts>1])*100/ length(S$transcripts)  
 #calculate percentage of genes with more than 1 protein  
 length(S$proteins[S$proteins>1])*100/ length(S$proteins)  
 summary(S)

Over half of the 17,564 annotated genes (10,136; 57%) have more than one transcript isoform. However, only 37% (6,584) of the genes have more than one protein, suggesting that most of the transcript diversity is outside the protein coding regions. While the genes have on average 17.39 transcripts, only 2.59 proteins are found on average per gene.

This contrast between the number of transcripts and proteins is very pronounced in genes with very large number of transcripts. Infact, the gene "gish" which has 18,972 transcripts has only 142 proteins. See below plot that shows the number of records with the words "exon" and "CDS" from the "gish" gene in the new annotation.

The above figure clearly shows that most "exon" records occur at the beginning and end of the gene. The "CDS" records are restricted to the core of the gene. The same pattern is better captured by the authors of the study using the new metric `per cent spliced in’ (Ψ) index.

Thursday, September 25, 2014

Genome base pair correction tool

High rates of sequencing errors in Next-generation sequencing methods can percolate down into genome assemblies. Presence of incorrect bases in genome assemblies can have an impact on analysis that use the genome assembly for functional analysis.While programs like REAPR are able to correct genome assemblies based on mapped reads, it is focused on identifying large scale structural errors in genome assemblies.

Draft genome assemblies are being generated for a large number of species using Next-generation sequencing methods. Here we present BCD (Base Correction Don), a tool that conservatively identifies single base-pair level errors in genome assembly sequences based on paired-end reads mapped to the genome.

 #!/usr/bin/perl  
 #use strict;  
 use warnings;  
 use List::Util qw(min max);  
 #perl fixbases.pl example.fasta example.coverage > example.corrected.fasta  
 # Input parameters  
 my $genome = $ARGV[0];  
 my $coverage = $ARGV[1];  
 my $totalcount=0;  
 my $minreadcoverage=20;# minimum number of reads required at site to consider it  
 my $maxreadcoverage=200;  
 my $maxindelcount=5;  
 open FASTA1, $genome or die $!;  
 open LOGS, ">test.bed" or die $!;  
 my $seqst_temp="";  
 my %seqs = ();  
 while($line = <FASTA1>){  
     if($line=~ /^>/){  
      if($header){$seqs{$header}=$seqst_temp;}  
      chomp $line;$header="";$header=$line;$seqst_temp="";  
     }  
     else{$line =~ s/[\n\t\f\r_0-9\s]//g;$seqst_temp .= $line;}  
  }#end of while loop  
     if($header){$seqs{$header}=$seqst_temp;}  
 close FASTA1;  
 open FASTA2, $coverage or die $!;  
 while($line = <FASTA2>){  
 chomp $line;$line=$line . "\t}";  
 my @jelly=split(/\s+/,$line);  
 splice @jelly, 4, 0, 'SnowWhite', 'Humbert';  
 my $scaf=$jelly[0];  
 my $pos=$jelly[1];  
 my $oldbase=$jelly[2];  
 my $totalreads=$jelly[3];  
 my @jellyA=split(/\:/,$jelly[7]);#A  
 my @jellyC=split(/\:/,$jelly[8]);#C  
 my @jellyG=split(/\:/,$jelly[9]);#G  
 my @jellyT=split(/\:/,$jelly[10]);#T  
 my @jellyN=split(/\:/,$jelly[11]);#N  
 my $indelcount=0;  
 if(!($jelly[12]=~ /\}/)){  
 my @jellyindel=split(/\:/,$jelly[12]);#indel  
 $indelcount=$jellyindel[1];  
 }  
 my @counts;  
 push @counts,$jellyA[1];  
 push @counts,$jellyC[1];  
 push @counts,$jellyG[1];  
 push @counts,$jellyT[1];  
 push @counts,$jellyN[1];  
 push @counts,$indelcount;  
     if(($totalreads>$minreadcoverage)&&($indelcount<$maxindelcount)&&($totalreads<$maxreadcoverage)){$totalcount++;  
 my $maxbase= max @counts;  
 my $maxbasepercent=$maxbase*0.1;  
 my $maxbaseval=$oldbase;  
 if($jellyA[1]==$maxbase){$maxbaseval="A";}  
 elsif($jellyC[1]==$maxbase){$maxbaseval="C";}  
 elsif($jellyG[1]==$maxbase){$maxbaseval="G";}  
 elsif($jellyT[1]==$maxbase){$maxbaseval="T";}  
 elsif($jellyN[1]==$maxbase){$maxbaseval="N";}  
 else{$maxbaseval=$oldbase;}  
 my $maxbaseval2=lc $maxbaseval;  
 if((($oldbase=~ /[Aa]/)&&($jellyA[1]==0))||(($oldbase=~ /[Cc]/)&&($jellyC[1]==0))||(($oldbase=~ /[Gg]/)&&($jellyG[1]==0))||(($oldbase=~ /[Tt]/)&&($jellyT[1]==0))){  
 print LOGS "$scaf\t$pos\t$pos\t$maxbaseval\t$oldbase.\t$jellyA[1]\t$jellyC[1]\t$jellyG[1]\t$jellyT[1]\t$jellyN[1]\tzero\t$totalreads\n";  
 my $key=">" . $scaf;  
 my $prepos=$pos-1;#1 based positions  
 substr($seqs{$key},$prepos,1)= $maxbaseval;  
 }  
 elsif((($oldbase=~ /[Aa]/)&&($jellyA[1]<$maxbasepercent))||(($oldbase=~ /[Cc]/)&&($jellyC[1]<$maxbasepercent))||(($oldbase=~ /[Gg]/)&&($jellyG[1]<$maxbasepercent))||(($oldbase=~ /[Tt]/)&&($jellyT[1]<$maxbasepercent))){  
 print LOGS "$scaf\t$pos\t$pos\t$maxbaseval\t$oldbase.\t$jellyA[1]\t$jellyC[1]\t$jellyG[1]\t$jellyT[1]\t$jellyN[1]\tpercent\t$totalreads\n";  
 my $key=">" . $scaf;  
 my $prepos=$pos-1;#1 based positions  
 substr($seqs{$key},$prepos,1)= $maxbaseval;  
 }  
     }  
  }#end of while loop  
 close FASTA2;  
 my $iso;  
 foreach $iso (sort keys %seqs) {  
 print "$iso\n$seqs{$iso}\n";  
 }  
 #scaffold_0   50   c    45   GENOMEseq    {    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 C:1:29.00:44.00:29.00:0:1:0.06:0.00:0.00:0:0.00:100.00:0.03 G:43:28.49:64.21:27.42:20:23:0.59:0.04:234.35:20:0.62:97.95:0.44    T:1:29.00:68.00:29.00:1:0:0.92:0.16:659.00:1:0.53:98.00:0.53  N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00   }  
 #scaffold_0   1    g    43   =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00   G:40:33.40:59.08:31.00:19:21:0.00:0.03:0.00:0:0.00:0.00:0.00  T:2:31.00:33.00:31.00:0:2:0.00:0.05:0.00:0:0.00:0.00:0.00    N:1:60.00:33.00:25.00:1:0:0.00:0.05:0.00:0:0.00:0.00:0.00  
 print LOGS "$totalcount\n";  
 close LOGS;

All paired end data used for performing the genome assembly were mapped back to the genome assembly using bwa read mapper with default settings. Number of reads supporting each of the four bases at individual positions was tabulated using the program bam-readcount. The above script was then used to correct the genome for single base pair errors.

Friday, August 22, 2014

Extract sequences from multi fasta file in a particular order

 #!/usr/bin/perl  
 use warnings;  
 #perl extractInOrder.pl test.fasta test.list  
 # Input parameters  
 open FASTA, $ARGV[0] or die $!;  
 my $seqst_temp="";  
 my $seqs = ();  
 while($line = <FASTA>){  
 if($line=~ /^>/){  
 if($header){  
 $header=~s/>//;  
 $seqs{$header}=$seqst_temp;  
 }  
 chomp $line;  
 $header="";  
 $header=$line;  
 $seqst_temp="";  
 }  
 else{  
 $line =~ s/[\n\t\f\r_0-9\s]//g;  
 $seqst_temp .= $line;  
 }  
 }#end of while loop  
 if($header){  
 $header=~s/>//;  
 $seqs{$header}=$seqst_temp;  
 }  
 close FASTA;  
 open FASTA, $ARGV[1] or die $!;  
 while($line = <FASTA>){  
 chomp $line;  
 $line=~s/>//;  
 if(exists $seqs{$line}){print ">$line\n$seqs{$line}\n";}  
 else {  
 print "Sequence header $line does not exist in fasta\n";  
 exit();  
 }  
 }  
 close FASTA;

Merge multiple kmer count hashes into one

A previous attempt at merging two kmer count hashes was neither memory efficient nor capable of merging multiple kmer count hashes. Here, we use the age old trick of sorting to write a more memory efficient script that can handle as N number of hashes.

 cat *_"$kmer"_counts.fa|sort > sorted_"$kmer"_all.fa

The above command will concatenate all the hashes and sort it. This sorted file can then be used by the below perl script to merge the hashes. Since, all kmers that need to be merged are in adjacent lines, the memory needed for merging is drastically reduced compared to the previous script.

 #!/usr/bin/perl  
 use warnings;  
 # Input parameters  
 open FASTA1, $ARGV[0] or die $!;  
 my $previous="Kmer";  
 my $previousCount="Kmercount";

 my @jelly;

  while($line = <FASTA1>){  
 chomp $line;  
 @jelly=split(/\s+/,$line);  
      if($previous=~/$jelly[0]/){  
      $previousCount=$previousCount+$jelly[1];  
      }  
      else{  
      print "$previous\t$previousCount\n";  
      $previous=$jelly[0];$previousCount=$jelly[1];  
      }  
 }  
      #printing last line if it needed merging  
      if($previous=~/$jelly[0]/){  
      print "$previous\t$previousCount\n";  
      }  
 close FASTA1;

Tuesday, August 12, 2014

Merge two kmer count hashes into one

 #!/usr/bin/perl  
 use warnings;  
 # Input parameters  
 open FASTA1, $ARGV[0] or die $!;  
 open FASTA2, $ARGV[1] or die $!;  
 my %seqs = ();  
 while($line = <FASTA1>){  
 chomp $line;  
 my @jelly=split(/\s+/,$line);  
 $seqs{$jelly[0]}=$jelly[1];  
 }  
 close FASTA1;  
 while($line = <FASTA2>){  
 chomp $line;  
 my @jelly=split(/\s+/,$line);  
 if(exists $seqs{$jelly[0]}){$seqs{$jelly[0]}=$seqs{$jelly[0]}+$jelly[1];}  
 else{$seqs{$jelly[0]}=$jelly[1];}  
 }  
 close FASTA2;  
 foreach $iso (sort keys %seqs) {  
 print "$iso\t$seqs{$iso}\n";  
 }

Will take two lists of kmer counts and merge them into one. A kmer count list consists of two columns. The first column being the kmer itself and the second being its count.

This might be useful in merging the output of a program like jellyfish after running it on each chromosome separately. While jellyfish has a merge function, it requires the hashes to be of equal size.

May be this can be re-written as a perl one liner...

Ka ka ka ka ka ka ka ka

Previous crow stuff has been covered by Jerry Coyne himself. This in turn stimulates lot of debate on what constitutes a species. Fortunately, most people agree that understanding the genetics is more important and relevant than fighting over the definition of species.

Like the title of this post, the call of a crow can be written as "ka ka ka ka ka" or "kaw kaw kaw kaw" or "Caw caw caw caw". They are all trying to capture the same thing. As long as the underlying genetics is well understood, it should be ok to not worry about the definition for now. Hopefully, over time our understanding of genetics will be so good that we may even come up with a single definition that most of us can agree on.

Monday, August 11, 2014

Get overlapping sequences from multifasta file

 #!/usr/bin/perl  
 #use strict;  
 use warnings;  
 # Input parameters  
 open FASTA, $ARGV[0] or die $!;  
 my $seqst_temp="";  
 my %seqs = ();  
 my $iso="";  
 my $maxlen=0;  
 my $maxval="";  
 while($line = <FASTA>){  
 if($line=~ /^>/){  
 if($header){  
 $seqs{$header}=$seqst_temp;  
 }  
 chomp $line;  
 $header="";  
 $header=$line;  
 $seqst_temp="";  
 }  
 else{  
 $line =~ s/[\n\t\f\r_0-9\s]//g;  
 $seqst_temp .= $line;  
 }  
 }#end of while loop  
 if($header){  
 $seqs{$header}=$seqst_temp;  
 }  
 close FASTA;  
 $maxlen=0;  
 foreach $iso (sort keys %seqs) {  
 my $line1=$iso;  
 my $line2=$seqs{$iso};  
 my $flag=0,$overlap=500,$length=1000;  
 $seqlen=length $line2;  
 while(($seqlen-$flag)>$length){  
 if(($seqlen-($flag+$overlap))<$length){  
 $length=$seqlen-$flag;  
 }  
 $nextseq=substr $line2,$flag,$length;  
 print $line1.":".$flag."-".($flag+$length)."\n";  
 print $nextseq."\n";  
 $flag=$flag+$overlap;  
 }#end of seqlen while loop  
 }

Thursday, July 17, 2014

As the crow forks

We had our work on crows (The genomic landscape underlying phenotypic integrity in the face of gene flow in crows) published in the journal Science. Very interesting interpretations of the work ranging from it being proof for evolution to reasoning for human assortative mating in a political context have been presented. It has also been covered by the Swedish Radio, Der Spiegel (a popular German weekly) and few other offline sources.

A well written description of the science that puts the work in a broader context is also available.[It should also be noted that a description of another paper about the human polymorphic inversion 17q21 from the same author was a useful resource while looking at the ancestral state of our fixed differences.]

The whole genome was scanned for differences in DNA sequence content between carrion and hooded crows using 60 crows. While most of the genome is the same, small regions of the genome do show differences in DNA sequence and gene expression. These differences in DNA sequence are localized to specific regions of the genome. Infact all 81 of 82 fixed differences that we find are within a 2 megabase region.

Snapshots of a region of the genome with fixed differences seen between hooded and carrion crow sequencing reads provides a very visual idea of how these fixed differences "look". Each of the boxes in the figure is a sequencing read (from the bam file) mapped to the reference genome. Positions which differ from the reference genome have the actual base that differs from the reference in each read.

Below figure shows the sequencing reads from one crow from Poland. Note that all the positions are similar to the reference genome. All 30 hooded crow individuals (from Sweden and Poland) look exactly like this at this base.

The next figure shows the sequencing reads from one crow from Germany at the same position as the hooded crow above. However, the blue box has "C" written in all its reads. At this position the Carrion crow has a base that is different from the Hooded crow reference genome. All the 30 carrion crows (from Spain and Germany) have this "C" base at this position.

Similar to the above position, we identify 81 other sites that show such a "fixed difference". The rest of the genome seems to be extremely similar between the carrion and hooded crows.

Friday, June 27, 2014

The Elephant Catchers

It has been 6 years since i read and blogged about Subroto Bagchi's book Go kiss the world. As with the previous book, finished reading the book "The Elephant Catchers" in one sitting. The writing style is very captivating and keeps you reading.

The Elephant Catchers is all about growing a company that is hunting rabbits (small game) into one that now catches elephants. While many people start a startup, very few can make it into a self-sustaining and growing company. The book deals with the mistakes that happen and how to go about "fixing" those mistakes.

Many anecdotes from different points along the way makes for an interesting read. The use of personification at different points to explain very different stories is very appealing. For example, at one point the company wins a big client who could be persuaded to hire a dog that is loyal instead of cat that is pleasing.

Dividing the book into 6 parts with sub-chapters keeps it well organized. All of them have one thing in common, it is about scaling things. Be it scaling ideas or infrastructure the wisdom of having been there and having done that shines through brilliantly.

Monday, June 16, 2014

Largest known protein - Titin

With a length of more than 30Kb Titin is the largest known protein in the Human genome. Due to its repeated use of the same domains, it is thought have undergone very unpredictable exon losses.

Apart from the biological reality of exon loss, the incredible length of the gene makes it prone to annotation errors. Prevalence of such large scale annotation errors makes it impossible to study the intricacies of the biology of such a gene. This is an attempt to identify such potential errors in annotation. The hope is that it will contribute to improving the annotation.

Mutations in the Titin gene have been implicated in many diseases. Being the longest gene also makes it interesting from an evolutionary point of view.

Chicken:

The human version of the gene (located on chr-2) has 363 annotated exons as per Human release 75 of Ensemble. The chicken version of the gene (located on chr-7) has only 47 annotated exons as per Chicken release 75 of Ensemble. Flanking genes are PLEKHA3 and CCDC141. However, a new gene, "ENSGALG00000026366" has been annotated in Chicken between PLEKHA3 and TITIN. While it has a name like "gga-mir-7474" and a link to the mirbase, it has gene type as protein coding. If that was not confusing enough, this gene has an aminoacid length of ~30Kb (has 269 exons). Given its location and length, it appears that the Titin gene has been incorrectly split into two genes (Titin itself and gga-mir-7474). As expected, the two genes are connected by a chicken cDNA EST (see figure below).

This "gga-mir-7474" gene gets top blast hit from Titin.

So based on EST data and blast data these two genes can be merged as part of the Titin gene. We are still short by 47 more exons.

Flycatcher:

The release 75 of Ensemble has a ~61Kb long TTN gene with 106 exons annotated in Flycatcher. Two very short genes (with 1 and 2 exons) downstream from this TTN gene, ENSFALT00000015943 and ENSFALT00000003626 also give top blast hit to the TTN gene.

Even with availability of EST evidence, some very good objective predictions, the annotation seems rather sketchy.