Wednesday, January 22, 2014

Nth order Markov chain in perl given stationary distribution and transition probability matrix

A Markhov chain can be simulated given a transition probability matrix.

 use warnings;  
 use strict;  
 open MODELS, $ARGV[0] or die $!;  
 my $cyclestorun=20;  
 my %model=();  
 my $line = <MODELS>;chomp $line;  
 my @headers=split(/\t/,$line);  
 my @values="";  
 my @stationVals;  
 my @stationDist;  
 my $check=0;  
 my $rnd=0;  
 my $test=0;  
 my $index=0;  
 my $startval="";  
 my $preval="";  
 my $nextval="";  
 my $i=0;  
 my $j=0;  
 my $stat=0;  
 my $presub="";  
 while($line = <MODELS>){  
      chomp $line;@values=split(/\t/,$line);$check=0;  
      for($i=2;$i<scalar @headers;$i++){  
      if($check>1){print "Error in Transition probability matrix in line: $line\n";exit;}  
 #     print "$values[0]\t$headers[$i]\t$model{$values[0]}{$headers[$i]}\n";  
 }#end of while loop  
 close MODELS;  
      print "\t\t\tInitialising stationary distribution\n\n\n";  
 foreach $stat(@stationDist){  
      if($check>1){print "Error in Stationary distribution.Probability greater than 1!\n";exit;}  
 foreach $stat(@stationDist){  
      #exit the foreach loop  
      print "\t\tInitial step\t$rnd\t$startval\n\n";  
 $preval=$startval; $test=0;  
      for($i=2;$i<scalar @headers;$i++){  
           #exit the foreach loop  
      if((length $preval)>1){$presub=substr $preval,1,1;}  
      $preval=$presub . $nextval;  
      print "\t\t$j\t$rnd\t$preval\n";  
 }#end of cyclestorun loop  

Example cases:

First order Markhov chain:
The "stat" column corresponds to the stationary matrix.

      stat     A     G     T  
 A     0.1     0.1     0.5     0.4  
 G     0.5     0.9     0.1     0  
 T     0.4     0.3     0.3     0.4  


 perl Markhov1.model  
                Initialising stationary distribution  
           Initial step     0.566160679402859     G  
           1     0.726538222140761     A  
           2     0.307280103481514     G  
           3     0.892182052225458     A  
           4     0.336793610721617     G  
           5     0.205083725028075     A  
           6     0.523360180106774     G  
           7     0.376126474509043     A  
           8     0.385466656638901     G  
           9     0.357110639627308     A  
           10     0.940985077542603     T  
           11     0.567528252945955     G  
           12     0.215130112396754     A  
           13     0.737550170708133     T  
           14     0.372513088005231     G  
           15     0.87686981502349     A  
           16     0.534856061888437     G  
           17     0.94194553431501     G  
           18     0.800215938891096     A  
           19     0.990931721477871     T  
           20     0.371922711443307     G  

Second order Markhov chain:

      stat     A     G     T  
 AA     0.1     0.1     0.5     0.4  
 AG     0.05     0.9     0.1     0  
 AT     0.04     0.3     0.3     0.4  
 GA     0.2     0.1     0.5     0.4  
 GG     0.1     0.9     0.1     0  
 GT     0.1     0.3     0.3     0.4  
 TA     0.1     0.1     0.5     0.4  
 TG     0.1     0.9     0.1     0  
 TT     0.2     0.3     0.3     0.4  


 perl Markhov2.model  
                Initialising stationary distribution  
           Initial step     0.625368699737923     TA  
           1     0.590942442973272     AG  
           2     0.0483144526753101     GA  
           3     0.0103876021193585     AA  
           4     0.367650618439331     AG  
           5     0.973362473523579     GG  
           6     0.310508404063821     GA  
           7     0.890260746277896     AT  
           8     0.4738624084188     TG  
           9     0.158245684870735     GA  
           10     0.295448727223441     AG  
           11     0.33900585792933     GA  
           12     0.988197918079994     AT  
           13     0.366480732651112     TG  
           14     0.229562890090616     GA  
           15     0.543711924314373     AG  
           16     0.114082525688733     GA  
           17     0.850092575768059     AT  
           18     0.546652122930436     TG  
           19     0.715004225102355     GA  
           20     0.20491958253287     AG  

Any number of states can be added, as well as any order can be simulated.


      stat     A     G     T  
 A     0.33     0.33     0.33     0.33  
 G     0.33     0.33     0.33     0.33  
 T     0.33     0.33     0.33     0.33  

 perl Markhov0.model  
                Initialising stationary distribution  
           Initial step     0.804352114516409     T  
           1     0.967022104046436     T  
           2     0.146048431090218     A  
           3     0.168446080849158     A  
           4     0.806730437919288     T  
           5     0.0541887571889532     A  
           6     0.90109841189982     T  
           7     0.815255933249396     T  
           8     0.647698135106484     G  
           9     0.93008693266961     T  
           10     0.721468943653047     T  
           11     0.311067960681136     A  
           12     0.231977046730197     A  
           13     0.251340034965111     A  
           14     0.140686836632536     A  
           15     0.0760768098699529     A  
           16     0.182739903656344     A  
           17     0.496555550139448     G  
           18     0.219248000363631     A  
           19     0.163342643602853     A  
           20     0.994710285749438     A  
And now to use it in a real world application....any suggestions?

Monday, January 20, 2014

Perl script to print codon usage per sequence in multi-fasta file

 open GENES, $ARGV[0] or die $!;  
 my @bases = ("A", "T", "C", "G");  
 my %ctable=();  
 my %genes=();  
 while($line = <GENES>){  
 if($line=~ /^>/){  
      chomp $line;  
      $line =~ s/\>//g;  
      $line =~ s/[\n\t\f\r_0-9\s]//g;  
      $seqst_temp .= $line;  
 }#end of while loop  
      foreach $base1 (@bases){  
      foreach $base2 (@bases){  
      foreach $base3 (@bases){  
      #print "$base1\t$base2\t$base3\n";  
      my $codon="$base1" . "$base2" . "$base3";  
      #populate all possible codons  
 print "Genename\t";foreach $key (sort keys %ctable) {print "$key\t";}print "\n";  
 foreach $genekey (sort keys %genes) {  
 print "$genekey\t";  
 @gcodons = ( $genes{$genekey} =~ m/.../g );  
      foreach $base1 (@bases){  
      foreach $base2 (@bases){  
      foreach $base3 (@bases){  
      #print "$base1\t$base2\t$base3\n";  
      my $codon="$base1" . "$base2" . "$base3";  
      #populate all possible codons  
      foreach $gcodon (@gcodons) {  
 #print "$gcodon\n";  
      if(exists $ctable{$gcodon}){$ctable{$gcodon}++;}  
      foreach $key (sort keys %ctable) {  
         print "$ctable{$key}\t";  
      print "\n";  
 }#end genes foreach  
 close GENES;  

Tuesday, January 14, 2014

Telo-seq: Estimating telomere lengths with Next Generation Sequencing

Telomere lengths have been associated with changes in age and disease. The need for telomere length measurements has lead to development of many methods. However, each of these methods has its own set of advantages and drawbacks. Being able to make use of the latest advances in sequencing technology to measure telomere length is a very appealing idea. The Perl script Telo-seq is designed to demonstrate the possibility of utilizing whole genome sequencing data to estimate relative telomere lengths. Application of Telo-seq to human samples from different age groups shows the age-related reduction in telomere length that has been demonstrated in humans with other methods. The relative telomere length reduces from 0.0003 in newborns to 0.0001 in centenarians.

Availability and implementation: The perl script Telo-seq is freely available at


The ends of chromosomes contain repetitive regions known as telomeres that protect chromosomes from degradation or fusion with other chromosomes. During the normal life of a cell, the telomeres undergo reduction in size at each cell division. Hence, over time cells will loose their telomeres. This is followed by loss of DNA located at the ends of chromosomes. So eventually cells die due to the loss of their telomeres. This phenomenon is characterized by the Hayflick limit [1], which sets a limit to the number of times a cell with linear chromosomes can undergo cell division. Most unicellular organisms have circular chromosomes making it possible for them to replicate indefinitely. Changes in the length of telomeres have been studied and are known to undergo changes with factors like age, disease and stress.

Telomere lengths have been measured using various methods such as Terminal Restriction Fragment (TRF) southern blot, FISH, Q-FISH and real time PCR. New methods of measurement that are robust, easy to use and sensitive are always being developed. However, methods for telomere length measurement can have specific requirements such as cell type, amount of DNA and cell growth condition. Each method has its own set of drawbacks and advantages which has been reviewed earlier [2].

Next-Generation sequencing data has been used to estimate relative telomere length in male vs females pools by counting reads [3]. However, the effect of PCR duplicates, dynamic range of the estimates and variance between sequencing runs have not been evaluated. Here we present a easy to use stand-alone tool, Telo-Seq that can estimate relative telomere length and demonstrate that Next-Generation sequencing data can be used to measure age-related changes in relative telomere length.

Whole genome sequencing data from the cord blood of a new born, 26 year old female and a 103 year old male [4] are available in the SRA (Short Read Archive). Raw fastq files for each of the datasets was downloaded and analyzed. Reads were classified as telomeric reads (table 1) if at least 5 occurrences of the 6 base vertebrate specific telomeric repeat "TTAGGG" occurred sequentially in the 90 bp reads. Varying the required number of occurrences of the repeat does not affect the analysis as all reads analyzed are of 90 bp length. While comparing samples with different read lengths, this parameter has to be tuned accordingly based on the frequency of the telomeric repeat k-mer.

The estimation of relative telomere lengths using next generation sequencing has several advantages over the methods that are currently used. However, this method is not without limitations.


  1. Samples from most tissues or cell types can be used to obtain DNA for sequencing. With the availability of single cell sequencing methods, it will also be possible to measure telomere length differences between individual cells.
  2. Species specific requirements such as specific antibodies and special cell preparation procedures are not required.
  3. Prior knowledge of karyotype or genome assembly are not required as the measurements are relative to the total amount of DNA sequenced.


  1. The dynamic range of these measurements is very narrow, with just 100 to 300 reads per million reads sequenced. Pooling of samples is not possible due to small fraction of reads that are telomeric.
  2. Removal of PCR duplicates is almost impossible due to the repetitive nature of telomeric reads without introducing biases in the estimates.
  3. Large variance (see figure 1) between sequencing runs reduces the sensitivity of the method.

Further improvements in read length and removal of artefact's like PCR duplicates from these technologies will make this method more robust. While the cost of sequencing has come down drastically, being able to sequence a sample just to measure telomere lengths is still very expensive. However, this can be an attractive option to measure telomere length in samples that have been sequenced for other purposes.

Future development:

Coverage at different K-mer frequencies can be used to estimate the actual telomere lengths and has special utility for newly sequenced species. Large insert size mate-pair libraries can also be used along with the K-mer frequency estimates to obtain telomere lengths that are specific to chromosomes. Being able to correctly map one of the mates of a mate-pair also has utility in anchoring Next-generation genome assemblies.


  1. Shay JW & Wright WE, Nat Rev Mol Cell Biol. 2000. 1(1):72-6 [PMID:11413492].
  2. Vera E & Blasco M A, Aging (Albany NY). 2012. 4(6): 379–392 [PMCID: PMC3409675].
  3. Table 1: Details of the cell type and number of telomere reads in each replicate for all the samples analyzed.

  4. Castle JC et al, BMC Genomics. 2010. 11:244 [PMID:20398377]
  5. Heyn H et al, Proc Natl Acad Sci USA. 2012. 109(26):10522-7 [PMID:22689993]


    Table 1: Details of the cell type and number of telomere reads in each replicate for all the samples analyzed.

    SRR330574103 years417362717971600.0001535557CD4+T CellsmaleCaucasianPeripheral blood
    SRR330575103 years265722125536920.0001250131CD4+T CellsmaleCaucasianPeripheral blood
    SRR330576103 years396403533876720.0001121714CD4+T CellsmaleCaucasianPeripheral blood
    SRR330577103 years319683041479360.0001051067CD4+T CellsmaleCaucasianPeripheral blood
    SRR38924926 years1041805775173200.0001803929mononuclear cellsfemaleCaucasianPeripheral blood
    SRR38924826 years1748125685246840.0003074836mononuclear cellsfemaleCaucasianPeripheral blood
    SRR330578newborn879403139075760.0002801462CD4+T CellsmaleCaucasianUmbilical cord blood
    SRR330579newborn456801654092080.0002761636CD4+T CellsmaleCaucasianUmbilical cord blood
    SRR394135newborn20900853568040.0002448545CD4+T CellsmaleCaucasianUmbilical cord blood
    SRR394136newborn19340854936760.0002262156CD4+T CellsmaleCaucasianUmbilical cord blood
    SRR394137newborn22332855263560.0002611125CD4+T CellsmaleCaucasianUmbilical cord blood
    SRR394138newborn21432855242360.0002505956CD4+T CellsmaleCaucasianUmbilical cord blood
    SRR394139newborn24724799290720.0003093242CD4+T CellsmaleCaucasianUmbilical cord blood
    SRR394140newborn25080799295240.0003137764CD4+T CellsmaleCaucasianUmbilical cord blood
    SRR394141newborn26636798912280.0003334033CD4+T CellsmaleCaucasianUmbilical cord blood
    SRR394142newborn27300798854120.0003417395CD4+T CellsmaleCaucasianUmbilical cord blood
Figures: Figure 1: Relative telomere length measurements for Next-generation sequencing samples from different age groups.

Saturday, January 11, 2014

VCF to calls format

Perl script that reads through a VCF file and prints it out in the "calls" format.

Although i could not find any proper documentation for the calls format, it can be described as diploid genotype calls represented using ambiguous DNA bases for each individual with the first line being the individual id's or names and the following lines being the genotypes at different positions.

 use List::Util qw[min max];  
 # Input parameters  
 my $vcf_file = $ARGV[0];  
 #Go through fasta file, extract sequences  
 open(IN, $vcf_file);  
 while($z=<IN>) {  
           print "scaffold\tposition";  
           chomp $z;  
           $totind=scalar @values;  
           print "\t$values[$i]";  
      print "\n";  
           chomp $z;  
                if(length $altbase < 2){#skip tri-allelic sites  
                     print "$values[0]\t$values[1]";  
                     $totind=scalar @values;  
                $parts[0] =~ s/\///g;$parts[0] =~ s/0/$refbase/g;$parts[0] =~ s/1/$altbase/g;  
                $parts[0] =~ s/[A][A]/A/ig;$parts[0] =~ s/[T][T]/T/ig;$parts[0] =~ s/[C][C]/C/ig;$parts[0] =~ s/[G][G]/G/ig;$parts[0] =~ s/[.][.]/N/g;  
                $parts[0] =~ s/[AT][AT]/W/ig;$parts[0] =~ s/[GT][GT]/K/ig;$parts[0] =~ s/[AC][AC]/M/ig;$parts[0] =~ s/[AG][AG]/R/ig;$parts[0] =~ s/[CT][CT]/Y/ig;$parts[0] =~ s/[CG][CG]/S/ig;  
                     print "\t$parts[0]";  
           print "\n";  
 }#end of file while  
 close IN;  
 #perl genotyped.vcf > genotyped.calls  

Saturday, January 4, 2014

When The machine that won the war

The machine that won the war is a short story written by Issac Asminov in the 1960's. The story revolves around the end of a long war that has decimated an alien race leading to the survival of the human race.  One huge computer system is initially credited with making the decisions that lead to the victory. 

Data about results of various battle results, availability of resources and their locations are all collected and fed into this system. These data are processed by the super computer to provide accurate guidance to the commanders as to what decisions to make. However, over the course of the story we learn that the data being fed to it was too unreliable as most people involved in supplying the data could not be trusted. Moreover, even the data that was collected from these untrustworthy people was being manipulated so that it was "correct" before being sent to the computer. To make matters more interesting the computer system itself was not in a working conditions and could not trusted to interpret the data reliably. So the output was again being manipulated to take into account Murphy's law that says that anything that has to go wrong will go wrong. 

Finally it is revealed that although the data was being fixed at multiple stages, the final person responsible for using the decisions was using a toss of the coin to make the calls. 

Does this point to the fact that very complex systems are very hard to model, interpret and predict? So much so that a probabilistic approach performed as well as a very complicated model. May be the war could have been won much earlier if all the data was perfectly accurate and computer was in perfect working order and its instructions were followed to the letter. On the other hand having such perfect data without reliability issues might be hard to find in many systems that are very complex. Hence, the need to include a factor that takes the unreliability in the data into account.