Sunday, May 26, 2013

Trees of whole genomes, but locally

With increasing use of Next-generation sequencing methods, whole genomes of multiple species as well as closely related strains or populations are being generated. Sophisticated models to interpret this data that use our current knowledge of population dynamics and molecular processes are being used extensively.

The general idea is to be able to distinguish the population histories of various parts of the genome. The phylogenies of different parts of the genome might differ based on various functional constraints. Hence, building a single consensus tree for the entire genome will average over these local effects. To be able to overcome this limitation, many methods are being proposed to identify local phylogenies across the genome.

The program Saguaro, from Science for Life Laboratory and Broad Institute, is capable of using SNP data in VCF format or whole genome assembly alignments in MAF format to partition the genome based on local phylogenetic information. The method and its use on few example datasets is described in Unsupervised genome-wide recognition of local relationship patterns

Get gene start and end from GTF file

In a GTF file that does not have CDS records annotated, its tricky to get the start and end coordinates of a gene. This script does just that and prints a nice BED format file. It also report the genes that occur on multiple scaffolds or chromosomes as errors. These can be greped out using grep -v "Error".
 #!/usr/bin/perl  
 my $gtf1 = $ARGV[0];  
 my %scaf=();  
 my $start=();  
 my %end=();  
 my %multi=();  
 open(GTF1, $gtf1);  
 while($z=<GTF1>){   
 chomp $z;  
 my @parts=split(/\t/,$z);  
 my $start=$parts[3];  
 my $end=$parts[4];  
 my @parts2=split(/\s/,$parts[8]);  
 my $gene=$parts2[1];  
 $gene=~s/[\;\"]//g;  
 ##print "$gene\t$start\t$end\n";  
 if(!(exists $scaf{$gene})){$scaf{$gene}=$parts[0];}  
 elsif(!($scaf{$gene}=~m/$parts[0]/)){$multi{$gene}=1;}  
 if(!(exists $start{$gene})){$start{$gene}=$start;$end{$gene}=$end;}  
 else{$end{$gene}=$end;}  
 }  
 close(GTF1);  
 foreach my $keygene(sort keys %scaf){  
 if($start{$keygene}>$end{$keygene}){$temp=$start{$keygene};$start{$keygene}=$end{$keygene};$end{$keygene}=$temp;}  
 if(!(exists $multi{$keygene})){print "$scaf{$keygene}\t$start{$keygene}\t$end{$keygene}\t$keygene\n";}  
 else{print "Error:Gene $keygene on mutiple scaffolds\n";}  
 }  

Saturday, May 25, 2013

Perl script to filter sites with more than 50 percent sites missing by population

In certain cases, its required to have sites from a VCF file filtered by fraction of genotypes missing by population. While entire sites can be filtered using programs like VCFtools, its not possible to filter them by population. Here its assumed that all individuals from $ARGV[1] to $ARGV[2] are in one population and those from $ARGV[3] to $ARGV[4] are in another.

 #!/usr/bin/perl  
 use List::Util qw[min max];  
 # Input parameters  
 my $vcf_file = $ARGV[0];  
 my $firstmissing=0;  
 my $firstall=$ARGV[2]-$ARGV[1];  
 my $secondmissing=0;  
 my $secondall=$ARGV[4]-$ARGV[3];  
 #Go through fasta file, extract sequences  
 open(IN, $vcf_file);  
 while($z=<IN>) {  
 $firstmissing=0;  
 $secondmissing=0;  
      if($z=~m/^#/){print $z;}  
      else{  
           chomp $z;  
           @values=split(/\t/,$z);  
      $totind=length @values;  
      for($i=$ARGV[1];$i<=$ARGV[2];$i++){if($values[$i]=~m/\.\/\./){$firstmissing++;}}  
      for($i=$ARGV[3];$i<=$ARGV[4];$i++){if($values[$i]=~m/\.\/\./){$secondmissing++;}}  
 if((($secondmissing/$secondall)<0.5)&&(($firstmissing/$firstall)<0.5)){print "$z\n";}  
      }#end of else #  
 }#end of file while  
 close IN;  

Monday, May 20, 2013

Combining SNP data tables with perl

Continuing with our previous exploits Analyzing in vitro toxicity and expression data, its time to combine SNP data tables using perl.

#!/usr/bin/perl

open CANFAM2, $ARGV[0] or die $!;
open CANFAM3, $ARGV[1] or die $!;

%can2=();
%can3=();

while($line = <CANFAM2>){
chomp $line;
@tabs=split(/\s+/,$line);$tabs[9]=~s/[\;\"]//g;$tabs[11]=~s/[\;\"]//g;
$key=$tabs[9];
$value=$tabs[0] . "_" . $tabs[3] . "_" . $tabs[11];

if(exists $can2{$key}){print "Error:Duplicate record for SNP_ID $tabs[9]. Skipping second occurence.\n";}
else{$can2{$key}=$value;}

}#end of file while loop

while($line = <CANFAM3>){
chomp $line;
@tabs=split(/\s+/,$line);$tabs[9]=~s/[\;\"]//g;$tabs[11]=~s/[\;\"]//g;
$key=$tabs[9];
$value=$tabs[0] . "_" . $tabs[3] . "_" . $tabs[11];


if(exists $can3{$key}){print "Error:Duplicate record for SNP_ID $tabs[9]. Skipping second occurence.\n";}
else{$can3{$key}=$value;}

}#end of file while loop


print "SNP_ID\tbases\tchr_canfam2\tpos_canfam2\tchr_canfam3\tpos_canfam3\tdistance_between_versions\n";
$transitions=0;
$transversions=0;

foreach $mykey(sort keys %can2){
@can2tabs=split(/\_/,$can2{$mykey});

if($can2tabs[2]=~m/(AG|GA|CT|TC)/){$transitions++;}
elsif($can2tabs[2]=~m/(AC|CA|GT|TG)/){$transversions++;}

    if(exists $can3{$mykey}){
        @can3tabs=split(/\_/,$can3{$mykey});
            if($can2tabs[0]=~m/$can3tabs[0]/){$distance=abs($can2tabs[1]-$can3tabs[1]);}else{$distance="NA";}
        print "$mykey\t$can2tabs[2]\t$can2tabs[0]\t$can2tabs[1]\t$can3tabs[0]\t$can3tabs[1]\t$distance\n";
    }
    else {
    print "Error:SNP_ID $mykey missing in canfam3.\n";
    print "$mykey\t$can2tabs[2]\t$can2tabs[0]\t$can2tabs[1]\tNA\tNA\tNA\n";
    }
}

foreach $mykey(sort keys %can3){
    if(!(exists $can2{$mykey})){
    print "Error:SNP_ID $mykey missing in canfam2.\n";
    @can3tabs=split(/\_/,$can3{$mykey});
    print "$mykey\t$can3tabs[2]\tNA\tNA\t$can3tabs[0]\t$can3tabs[1]\tNA\n";

    if($can3tabs[2]=~m/(AG|GA|CT|TC)/){$transitions++;}
    elsif($can3tabs[2]=~m/(AC|CA|GT|TG)/){$transversions++;}
    }
}
$titv=$transitions/$transversions;
print "Info: $transitions Transistions\n";
print "Info: $transversions Transversions\n";
print "Info: $titv Ti/Tv ratio\n";

close CANFAM2;
close CANFAM3;


To do something bonus, lets calculate transitions and transversions to infer the Ti/Tv ratio.


Info: 1852527 Transistions
Info: 504318 Transversions
Info: 3.67333111251234 Ti/Tv ratio

Apart from this, 39,352 SNP's were found in only one of the two versions of the assembly. These could be attributed to gaps in the assembly that were filled at a later stage by Illumina sequencing. 

Sunday, May 19, 2013

Toxicity and expression data project

One possible solution to Analyzing in vitro toxicity and expression data perl project. Example table that is used as input to the perl script is given in below table.


"source_name_sid""casrn""chemical_name""assay_component_name""conc""value""value_type""plate_id""rep""row""col""target""time_hr""reference""media_reference""donor"
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.0041.08"fold_change"11310"E"5"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.0040.96"fold_change"11310"F"5"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.0040.94"fold_change"11310"G"5"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.0040.93"fold_change"11310"H"5"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.040.94"fold_change"1138"G"4"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.041.13"fold_change"1138"E"4"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.040.96"fold_change"1138"H"4"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.041.07"fold_change"1138"F"4"ABCB1"6300.5345.75778
"DSSTOX_40294""135158-54-2""Acibenzolar-S-Methyl""CLZD_ABCB1_6"0.41.02"fold_change"1136"F"3"ABCB1"6300.5345.75778

Actual perl code:
#!/usr/bin/perl

open TOXTABLE, $ARGV[0] or die $!;

%gcct=();

while($line = <TOXTABLE>){
chomp $line;
@tabs=split(/\t/,$line);

$key=$tabs[2] . "_" . $tabs[4] . "_" . $tabs[11] . "_" . $tabs[12];
$foldchange=$tabs[5];

if(exists $gcct{$key}){if($gcct{$key}<$foldchange){$gcct{$key}=$foldchange;}}
else{$gcct{$key}=$foldchange;}

}#end of file while loop

print "chemical_name\tconctarget\ttime_hr\tvalue\n";

foreach $mykey(sort keys %gcct){
@mytabs=split(/\_/,$mykey);
print "$mytabs[0]\t$mytabs[1]\t$mytabs[2]\t$mytabs[3]\t$gcct{$mykey}\n";
}


Just run the perl script with the filename of the table as first argument and it will print "which chemical compound and at what concentration causes the highest fold change of transcription in each gene at each timepoint".

The same can be done with one line in R using :



 aggregate(M$value,by=list(M$chemical_name,M$conc,M$target,M$time_hr),max)->N  

Sunday, May 12, 2013

Defined by the unknown

How would one explain one's linear existence to a being that has no concept of time? Deep Space Nine  has a situation in which this transpires. The very philosophical nature of the entire scenario makes it a rather interesting situation.

"It is the unknown that defines our existence. We are constantly searching, not just for answers to our questions, but for new questions. We are explorers. We explore our lives, day by day, and we explore the galaxy, trying to expand the boundaries of our knowledge."

While this might be true of the utopian society that the federation succeeds in building, its far from the truth today. The search is driven more by profit, that even the Firangi would be proud of us.