Nagarjun's blog: January 2014

Wednesday, January 22, 2014

Nth order Markov chain in perl given stationary distribution and transition probability matrix

A Markhov chain can be simulated given a transition probability matrix.

 #!/usr/bin/perl  
 use warnings;  
 use strict;  
 open MODELS, $ARGV[0] or die $!;  
 my $cyclestorun=20;  
 my %model=();  
 my $line = <MODELS>;chomp $line;  
 my @headers=split(/\t/,$line);  
 my @values="";  
 my @stationVals;  
 my @stationDist;  
 my $check=0;  
 my $rnd=0;  
 my $test=0;  
 my $index=0;  
 my $startval="";  
 my $preval="";  
 my $nextval="";  
 my $i=0;  
 my $j=0;  
 my $stat=0;  
 my $presub="";  
 while($line = <MODELS>){  
      chomp $line;@values=split(/\t/,$line);$check=0;  
      push(@stationVals,$values[0]);  
      push(@stationDist,$values[1]);  
      for($i=2;$i<scalar @headers;$i++){  
      $check+=$values[$i];  
      if($check>1){print "Error in Transition probability matrix in line: $line\n";exit;}  
      $model{$values[0]}{$headers[$i]}=$values[$i];  
 #     print "$values[0]\t$headers[$i]\t$model{$values[0]}{$headers[$i]}\n";  
      }  
 }#end of while loop  
 close MODELS;  
      print "\t\t\tInitialising stationary distribution\n\n\n";  
 $check=0;  
 foreach $stat(@stationDist){  
      $check+=$stat;  
      if($check>1){print "Error in Stationary distribution.Probability greater than 1!\n";exit;}  
 }  
 $rnd=rand(1);  
 $test=0;  
 foreach $stat(@stationDist){  
      $test+=$stat;  
      if($rnd<$test){  
      $startval=$stationVals[$index];  
      #exit the foreach loop  
      $test=-100;  
      }  
 $index++;  
 }  
      print "\t\tInitial step\t$rnd\t$startval\n\n";  
 $preval=$startval; $test=0;  
 for($j=1;$j<=$cyclestorun;$j++){  
      $test=0;$rnd=rand(1);  
      for($i=2;$i<scalar @headers;$i++){  
      $test+=$model{$preval}{$headers[$i]};  
           if($rnd<$test){  
           $nextval=$headers[$i];  
           #exit the foreach loop  
           $test=-100;  
           }  
      }  
      if((length $preval)>1){$presub=substr $preval,1,1;}  
      $preval=$presub . $nextval;  
      print "\t\t$j\t$rnd\t$preval\n";  
 }#end of cyclestorun loop

Example cases:

First order Markhov chain:
The "stat" column corresponds to the stationary matrix.

Model:

      stat     A     G     T  
 A     0.1     0.1     0.5     0.4  
 G     0.5     0.9     0.1     0  
 T     0.4     0.3     0.3     0.4

Output:

 perl markhov.pl Markhov1.model  
                Initialising stationary distribution  
           Initial step     0.566160679402859     G  
           1     0.726538222140761     A  
           2     0.307280103481514     G  
           3     0.892182052225458     A  
           4     0.336793610721617     G  
           5     0.205083725028075     A  
           6     0.523360180106774     G  
           7     0.376126474509043     A  
           8     0.385466656638901     G  
           9     0.357110639627308     A  
           10     0.940985077542603     T  
           11     0.567528252945955     G  
           12     0.215130112396754     A  
           13     0.737550170708133     T  
           14     0.372513088005231     G  
           15     0.87686981502349     A  
           16     0.534856061888437     G  
           17     0.94194553431501     G  
           18     0.800215938891096     A  
           19     0.990931721477871     T  
           20     0.371922711443307     G

Second order Markhov chain:

Model:

      stat     A     G     T  
 AA     0.1     0.1     0.5     0.4  
 AG     0.05     0.9     0.1     0  
 AT     0.04     0.3     0.3     0.4  
 GA     0.2     0.1     0.5     0.4  
 GG     0.1     0.9     0.1     0  
 GT     0.1     0.3     0.3     0.4  
 TA     0.1     0.1     0.5     0.4  
 TG     0.1     0.9     0.1     0  
 TT     0.2     0.3     0.3     0.4

Output:

 perl markhov.pl Markhov2.model  
                Initialising stationary distribution  
           Initial step     0.625368699737923     TA  
           1     0.590942442973272     AG  
           2     0.0483144526753101     GA  
           3     0.0103876021193585     AA  
           4     0.367650618439331     AG  
           5     0.973362473523579     GG  
           6     0.310508404063821     GA  
           7     0.890260746277896     AT  
           8     0.4738624084188     TG  
           9     0.158245684870735     GA  
           10     0.295448727223441     AG  
           11     0.33900585792933     GA  
           12     0.988197918079994     AT  
           13     0.366480732651112     TG  
           14     0.229562890090616     GA  
           15     0.543711924314373     AG  
           16     0.114082525688733     GA  
           17     0.850092575768059     AT  
           18     0.546652122930436     TG  
           19     0.715004225102355     GA  
           20     0.20491958253287     AG

Any number of states can be added, as well as any order can be simulated.

Zero-order:

Model:

      stat     A     G     T  
 A     0.33     0.33     0.33     0.33  
 G     0.33     0.33     0.33     0.33  
 T     0.33     0.33     0.33     0.33

Output:

 perl markhov.pl Markhov0.model  
                Initialising stationary distribution  
           Initial step     0.804352114516409     T  
           1     0.967022104046436     T  
           2     0.146048431090218     A  
           3     0.168446080849158     A  
           4     0.806730437919288     T  
           5     0.0541887571889532     A  
           6     0.90109841189982     T  
           7     0.815255933249396     T  
           8     0.647698135106484     G  
           9     0.93008693266961     T  
           10     0.721468943653047     T  
           11     0.311067960681136     A  
           12     0.231977046730197     A  
           13     0.251340034965111     A  
           14     0.140686836632536     A  
           15     0.0760768098699529     A  
           16     0.182739903656344     A  
           17     0.496555550139448     G  
           18     0.219248000363631     A  
           19     0.163342643602853     A  
           20     0.994710285749438     A

And now to use it in a real world application....any suggestions?

Monday, January 20, 2014

Perl script to print codon usage per sequence in multi-fasta file

 #!/usr/bin/perl  
 open GENES, $ARGV[0] or die $!;  
 my @bases = ("A", "T", "C", "G");  
 my %ctable=();  
 my %genes=();  
 while($line = <GENES>){  
 if($line=~ /^>/){  
           if($header){  
           $genes{$header}=$seqst_temp;  
           }  
      chomp $line;  
      $line =~ s/\>//g;  
      $header="";  
      $header=$line;  
      $seqst_temp="";  
      }  
      else{  
      $line =~ s/[\n\t\f\r_0-9\s]//g;  
      $seqst_temp .= $line;  
      }  
 }#end of while loop  
 if($header){  
 $genes{$header}=$seqst_temp;  
 }  
      foreach $base1 (@bases){  
      foreach $base2 (@bases){  
      foreach $base3 (@bases){  
      #print "$base1\t$base2\t$base3\n";  
      my $codon="$base1" . "$base2" . "$base3";  
      #populate all possible codons  
      $ctable{$codon}=0;  
      }}}  
 #header  
 print "Genename\t";foreach $key (sort keys %ctable) {print "$key\t";}print "\n";  
 foreach $genekey (sort keys %genes) {  
 print "$genekey\t";  
 @gcodons = ( $genes{$genekey} =~ m/.../g );  
      foreach $base1 (@bases){  
      foreach $base2 (@bases){  
      foreach $base3 (@bases){  
      #print "$base1\t$base2\t$base3\n";  
      my $codon="$base1" . "$base2" . "$base3";  
      #populate all possible codons  
      $ctable{$codon}=0;  
      }}}  
      foreach $gcodon (@gcodons) {  
 #print "$gcodon\n";  
      if(exists $ctable{$gcodon}){$ctable{$gcodon}++;}  
      }  
      foreach $key (sort keys %ctable) {  
         print "$ctable{$key}\t";  
      }  
      print "\n";  
 }#end genes foreach  
 close GENES;

Tuesday, January 14, 2014

Telo-seq: Estimating telomere lengths with Next Generation Sequencing

Telomere lengths have been associated with changes in age and disease. The need for telomere length measurements has lead to development of many methods. However, each of these methods has its own set of advantages and drawbacks. Being able to make use of the latest advances in sequencing technology to measure telomere length is a very appealing idea. The Perl script Telo-seq is designed to demonstrate the possibility of utilizing whole genome sequencing data to estimate relative telomere lengths. Application of Telo-seq to human samples from different age groups shows the age-related reduction in telomere length that has been demonstrated in humans with other methods. The relative telomere length reduces from 0.0003 in newborns to 0.0001 in centenarians.

Availability and implementation: The perl script Telo-seq is freely available at http://code.google.com/p/telo-seq/

Background:

The ends of chromosomes contain repetitive regions known as telomeres that protect chromosomes from degradation or fusion with other chromosomes. During the normal life of a cell, the telomeres undergo reduction in size at each cell division. Hence, over time cells will loose their telomeres. This is followed by loss of DNA located at the ends of chromosomes. So eventually cells die due to the loss of their telomeres. This phenomenon is characterized by the Hayflick limit [1], which sets a limit to the number of times a cell with linear chromosomes can undergo cell division. Most unicellular organisms have circular chromosomes making it possible for them to replicate indefinitely. Changes in the length of telomeres have been studied and are known to undergo changes with factors like age, disease and stress.

Telomere lengths have been measured using various methods such as Terminal Restriction Fragment (TRF) southern blot, FISH, Q-FISH and real time PCR. New methods of measurement that are robust, easy to use and sensitive are always being developed. However, methods for telomere length measurement can have specific requirements such as cell type, amount of DNA and cell growth condition. Each method has its own set of drawbacks and advantages which has been reviewed earlier [2].

Next-Generation sequencing data has been used to estimate relative telomere length in male vs females pools by counting reads [3]. However, the effect of PCR duplicates, dynamic range of the estimates and variance between sequencing runs have not been evaluated. Here we present a easy to use stand-alone tool, Telo-Seq that can estimate relative telomere length and demonstrate that Next-Generation sequencing data can be used to measure age-related changes in relative telomere length.

Whole genome sequencing data from the cord blood of a new born, 26 year old female and a 103 year old male [4] are available in the SRA (Short Read Archive). Raw fastq files for each of the datasets was downloaded and analyzed. Reads were classified as telomeric reads (table 1) if at least 5 occurrences of the 6 base vertebrate specific telomeric repeat "TTAGGG" occurred sequentially in the 90 bp reads. Varying the required number of occurrences of the repeat does not affect the analysis as all reads analyzed are of 90 bp length. While comparing samples with different read lengths, this parameter has to be tuned accordingly based on the frequency of the telomeric repeat k-mer.

The estimation of relative telomere lengths using next generation sequencing has several advantages over the methods that are currently used. However, this method is not without limitations.

Advantages:

Samples from most tissues or cell types can be used to obtain DNA for sequencing. With the availability of single cell sequencing methods, it will also be possible to measure telomere length differences between individual cells.
Species specific requirements such as specific antibodies and special cell preparation procedures are not required.
Prior knowledge of karyotype or genome assembly are not required as the measurements are relative to the total amount of DNA sequenced.

Limitations:

The dynamic range of these measurements is very narrow, with just 100 to 300 reads per million reads sequenced. Pooling of samples is not possible due to small fraction of reads that are telomeric.
Removal of PCR duplicates is almost impossible due to the repetitive nature of telomeric reads without introducing biases in the estimates.
Large variance (see figure 1) between sequencing runs reduces the sensitivity of the method.

Further improvements in read length and removal of artefact's like PCR duplicates from these technologies will make this method more robust. While the cost of sequencing has come down drastically, being able to sequence a sample just to measure telomere lengths is still very expensive. However, this can be an attractive option to measure telomere length in samples that have been sequenced for other purposes.

Future development:

Coverage at different K-mer frequencies can be used to estimate the actual telomere lengths and has special utility for newly sequenced species. Large insert size mate-pair libraries can also be used along with the K-mer frequency estimates to obtain telomere lengths that are specific to chromosomes. Being able to correctly map one of the mates of a mate-pair also has utility in anchoring Next-generation genome assemblies.

References:

Shay JW & Wright WE, Nat Rev Mol Cell Biol. 2000. 1(1):72-6 [PMID:11413492].
Vera E & Blasco M A, Aging (Albany NY). 2012. 4(6): 379–392 [PMCID: PMC3409675].

Table 1: Details of the cell type and number of telomere reads in each replicate for all the samples analyzed.

Castle JC et al, BMC Genomics. 2010. 11:244 [PMID:20398377]

Heyn H et al, Proc Natl Acad Sci USA. 2012. 109(26):10522-7 [PMID:22689993]

Tables:

Table 1: Details of the cell type and number of telomere reads in each replicate for all the samples analyzed.

Sample	Age	Telo_read	Total_reads	Ratio	cell_type	gender	racial_classification	source_tissue
SRR330574	103 years	41736	271797160	0.0001535557	CD4+T Cells	male	Caucasian	Peripheral blood
SRR330575	103 years	26572	212553692	0.0001250131	CD4+T Cells	male	Caucasian	Peripheral blood
SRR330576	103 years	39640	353387672	0.0001121714	CD4+T Cells	male	Caucasian	Peripheral blood
SRR330577	103 years	31968	304147936	0.0001051067	CD4+T Cells	male	Caucasian	Peripheral blood
SRR389249	26 years	104180	577517320	0.0001803929	mononuclear cells	female	Caucasian	Peripheral blood
SRR389248	26 years	174812	568524684	0.0003074836	mononuclear cells	female	Caucasian	Peripheral blood
SRR330578	newborn	87940	313907576	0.0002801462	CD4+T Cells	male	Caucasian	Umbilical cord blood
SRR330579	newborn	45680	165409208	0.0002761636	CD4+T Cells	male	Caucasian	Umbilical cord blood
SRR394135	newborn	20900	85356804	0.0002448545	CD4+T Cells	male	Caucasian	Umbilical cord blood
SRR394136	newborn	19340	85493676	0.0002262156	CD4+T Cells	male	Caucasian	Umbilical cord blood
SRR394137	newborn	22332	85526356	0.0002611125	CD4+T Cells	male	Caucasian	Umbilical cord blood
SRR394138	newborn	21432	85524236	0.0002505956	CD4+T Cells	male	Caucasian	Umbilical cord blood
SRR394139	newborn	24724	79929072	0.0003093242	CD4+T Cells	male	Caucasian	Umbilical cord blood
SRR394140	newborn	25080	79929524	0.0003137764	CD4+T Cells	male	Caucasian	Umbilical cord blood
SRR394141	newborn	26636	79891228	0.0003334033	CD4+T Cells	male	Caucasian	Umbilical cord blood
SRR394142	newborn	27300	79885412	0.0003417395	CD4+T Cells	male	Caucasian	Umbilical cord blood

Figures: Figure 1: Relative telomere length measurements for Next-generation sequencing samples from different age groups.

Saturday, January 11, 2014

VCF to calls format

Perl script that reads through a VCF file and prints it out in the "calls" format.

Although i could not find any proper documentation for the calls format, it can be described as diploid genotype calls represented using ambiguous DNA bases for each individual with the first line being the individual id's or names and the following lines being the genotypes at different positions.

 #!/usr/bin/perl  
 use List::Util qw[min max];  
 # Input parameters  
 my $vcf_file = $ARGV[0];  
 #Go through fasta file, extract sequences  
 open(IN, $vcf_file);  
 while($z=<IN>) {  
      if($z=~m/^##/){}  
      elsif($z=~m/^#/){  
           print "scaffold\tposition";  
           chomp $z;  
           @values=split(/\t/,$z);  
           $totind=scalar @values;  
           for($i=9;$i<$totind;$i++){  
           print "\t$values[$i]";  
           }  
      print "\n";  
      }  
      else{  
           chomp $z;  
           @values=split(/\t/,$z);  
           $refbase=$values[3];  
           $altbase=$values[4];  
                if(length $altbase < 2){#skip tri-allelic sites  
                     print "$values[0]\t$values[1]";  
                     $totind=scalar @values;  
                     for($i=9;$i<$totind;$i++){  
                     @parts=split(/\:/,$values[$i]);  
                $parts[0] =~ s/\///g;$parts[0] =~ s/0/$refbase/g;$parts[0] =~ s/1/$altbase/g;  
                $parts[0] =~ s/[A][A]/A/ig;$parts[0] =~ s/[T][T]/T/ig;$parts[0] =~ s/[C][C]/C/ig;$parts[0] =~ s/[G][G]/G/ig;$parts[0] =~ s/[.][.]/N/g;  
                $parts[0] =~ s/[AT][AT]/W/ig;$parts[0] =~ s/[GT][GT]/K/ig;$parts[0] =~ s/[AC][AC]/M/ig;$parts[0] =~ s/[AG][AG]/R/ig;$parts[0] =~ s/[CT][CT]/Y/ig;$parts[0] =~ s/[CG][CG]/S/ig;  
                     print "\t$parts[0]";  
                     }  
           print "\n";  
           }  
      }  
 }#end of file while  
 close IN;  
 #perl vcftocalls.pl genotyped.vcf > genotyped.calls

Saturday, January 4, 2014

When The machine that won the war

The machine that won the war is a short story written by Issac Asminov in the 1960's. The story revolves around the end of a long war that has decimated an alien race leading to the survival of the human race. One huge computer system is initially credited with making the decisions that lead to the victory.

Data about results of various battle results, availability of resources and their locations are all collected and fed into this system. These data are processed by the super computer to provide accurate guidance to the commanders as to what decisions to make. However, over the course of the story we learn that the data being fed to it was too unreliable as most people involved in supplying the data could not be trusted. Moreover, even the data that was collected from these untrustworthy people was being manipulated so that it was "correct" before being sent to the computer. To make matters more interesting the computer system itself was not in a working conditions and could not trusted to interpret the data reliably. So the output was again being manipulated to take into account Murphy's law that says that anything that has to go wrong will go wrong.

Finally it is revealed that although the data was being fixed at multiple stages, the final person responsible for using the decisions was using a toss of the coin to make the calls.

Does this point to the fact that very complex systems are very hard to model, interpret and predict? So much so that a probabilistic approach performed as well as a very complicated model. May be the war could have been won much earlier if all the data was perfectly accurate and computer was in perfect working order and its instructions were followed to the letter. On the other hand having such perfect data without reliability issues might be hard to find in many systems that are very complex. Hence, the need to include a factor that takes the unreliability in the data into account.