Tuesday, March 17, 2015

HGT candidate genes - GERP score distribution

HGT candidate genes have been identified in vertebrate species based on gene absence/presence in metazoans. The actual workflow can be found in figure S4.
Interestingly, they identify 145 genes that are candidates for HGT into all primates from various taxon and another 51 from viruses. They provide Ensemble gene id's for these genes in the Human genome.

Of the 145 genes, only 135 could be found in the Ensemble biomart today (Ensemble 79). The following ten genes have been "retired" and are not used in below analysis. These genes seem to be randomly drawn from the different "confidence classes" used in the paper.

ENSG00000107618    class A HGT    Validated
ENSG00000157358    class A HGT    Validated
ENSG00000196333    class C HGT    Not-validated
ENSG00000204486    class C HGT    Validated
ENSG00000204502    class C HGT    Validated
ENSG00000204513    class C HGT    Validated
ENSG00000212857    class C HGT    Not-validated
ENSG00000256062    class A HGT    Validated
ENSG00000260383    class B HGT    Not-validated
ENSG00000263074    class B HGT    Validated

ENSG00000107618 has been merged with ENSG00000265203 which is also a HGT candidate. ENSG00000256062 has been merged into ENSG00000175164, this is the much talked about ABO blood group gene. ENSG00000260383 has been merged into ENSG00000179832 which also happens to be a HGT candidate. ENSG00000263074 is merged into ENSG00000141337 (arylsulfatase G).

A total of 4018 exons could be found for these 135 genes. GERP scores could not obtained for 8 of these genes [ENSG00000066813,ENSG00000183248,ENSG00000183549,ENSG00000183747,
ENSG00000204510,ENSG00000229571,ENSG00000232423,ENSG00000243073]. Here, we obtain the GERP scores for each base of all these exons. Example code from the web was updated to use a bed file as input and print the GERP score for each base within the bed intervals.

 #!/usr/bin/env perl  
 use strict;  
 use warnings;  
 use lib '~/biomart-perl/lib';  
 use Bio::EnsEMBL::Registry;  
 use Bio::EnsEMBL::Utils::Exception qw(throw);  
 my $reg = "Bio::EnsEMBL::Registry";  
 my $species = "Homo sapiens";  
 my $line = "";  
 open FILE1, $ARGV[0] or die $!;  
 $reg->load_registry_from_db(  
    -host => "ensembldb.ensembl.org",  
    -user => "anonymous",  
 );  
 #get method_link_species_set adaptor  
 my $mlss_adaptor = $reg->get_adaptor("Multi", "compara", "MethodLinkSpeciesSet");  
 my $mlss = $mlss_adaptor->fetch_by_method_link_type_species_set_name("GERP_CONSERVATION_SCORE", "mammals");  
 throw("Unable to find method_link_species_set") if (!defined($mlss));  
 my $slice_adaptor = $reg->get_adaptor($species, 'core', 'Slice');  
 throw("Registry configuration file has no data for connecting to <$species>") if (!$slice_adaptor);  
 my $cs_adaptor = $reg->get_adaptor("Multi", 'compara', 'ConservationScore');                 
 while($line=<FILE1>){  
 #$line=$ARGV[0];  
 chomp $line;  
 my @parts=split('\t',$line);  
 my $seq_region = $parts[0];  
 my $seq_region_start = $parts[1];  
 my $seq_region_end =  $parts[2];  
 my $slice = $slice_adaptor->fetch_by_region('toplevel', $seq_region, $seq_region_start, $seq_region_end);  
 throw("No Slice can be created with coordinates $seq_region:$seq_region_start-$seq_region_end") if (!$slice);  
 my $display_size = $slice->end - $slice->start + 1;   
 my $scores = $cs_adaptor->fetch_all_by_MethodLinkSpeciesSet_Slice($mlss, $slice, $display_size);  
 #print "number of scores " . @$scores . "\n";  
 foreach my $score (@$scores) {  
   if (defined $score->diff_score) {  
 #     printf("position %d observed %.4f expected %.4f difference %.4f\n", $score->position, $score->observed_score, $score->expected_score, $score->diff_score);  
      printf("$parts[3]\t$parts[4]\t%.4f\n",$score->diff_score);  
   }  
 }  
 }  
 close FILE1;  

Negative GERP scores suggest that the sites are probably evolving neutrally.

 read.table(file="GERP.scores",header=FALSE,stringsAsFactors=FALSE)->M  
 as.data.frame(aggregate(M,list(M$V2),mean))->N  
 as.data.frame(aggregate(M,list(M$V2),max))->P  
 N$Group.1[N$V3<0]->negative.mean.GERP  
 jpeg("GERP_HGT_distribution.jpeg")  
 boxplot(as.numeric(M$V3)~M$V2,outline=FALSE,col=ifelse(unique(M$V2) %in% negative.mean.GERP, "red", "blue"),xaxt="n",ylab="GERP score",main="GERP score by gene")  
 legend("bottomright", c("Negative mean GERP", "Positive mean GERP"), fill = c("red", "blue"))  
 dev.off()  

Despite 41 genes having a mean GERP score less than 0, each of these genes have atleast a few sites that are having positive GERP scores.



Based on above arguments, it could be suggested that 10 of the 145 genes are probably annotation artifacts. The 127 genes for which GERP scores could be obtained from whole genome alignments show moderate levels of conservation, suggesting these genes are probably not artifacts.

No comments: