HGT candidate genes have been identified in vertebrate species based on gene absence/presence in metazoans. The actual workflow can be found in figure S4.
Interestingly, they identify 145 genes that are candidates for HGT into all primates from various taxon and another 51 from viruses. They provide Ensemble gene id's for these genes in the Human genome.
Of the 145 genes, only 135 could be found in the Ensemble biomart today (Ensemble 79). The following ten genes have been "retired" and are not used in below analysis. These genes seem to be randomly drawn from the different "confidence classes" used in the paper.
ENSG00000107618 class A HGT Validated
ENSG00000157358 class A HGT Validated
ENSG00000196333 class C HGT Not-validated
ENSG00000204486 class C HGT Validated
ENSG00000204502 class C HGT Validated
ENSG00000204513 class C HGT Validated
ENSG00000212857 class C HGT Not-validated
ENSG00000256062 class A HGT Validated
ENSG00000260383 class B HGT Not-validated
ENSG00000263074 class B HGT Validated
ENSG00000107618 has been merged with ENSG00000265203 which is also a HGT candidate. ENSG00000256062 has been merged into ENSG00000175164, this is the much talked about ABO blood group gene. ENSG00000260383 has been merged into ENSG00000179832 which also happens to be a HGT candidate. ENSG00000263074 is merged into ENSG00000141337 (arylsulfatase G).
A total of 4018 exons could be found for these 135 genes. GERP scores could not obtained for 8 of these genes [ENSG00000066813,ENSG00000183248,ENSG00000183549,ENSG00000183747,
ENSG00000204510,ENSG00000229571,ENSG00000232423,ENSG00000243073]. Here, we obtain the GERP scores for each base of all these exons. Example code from the web was updated to use a bed file as input and print the GERP score for each base within the bed intervals.
Negative GERP scores suggest that the sites are probably evolving neutrally.
ENSG00000204510,ENSG00000229571,ENSG00000232423,ENSG00000243073]. Here, we obtain the GERP scores for each base of all these exons. Example code from the web was updated to use a bed file as input and print the GERP score for each base within the bed intervals.
#!/usr/bin/env perl
use strict;
use warnings;
use lib '~/biomart-perl/lib';
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::Utils::Exception qw(throw);
my $reg = "Bio::EnsEMBL::Registry";
my $species = "Homo sapiens";
my $line = "";
open FILE1, $ARGV[0] or die $!;
$reg->load_registry_from_db(
-host => "ensembldb.ensembl.org",
-user => "anonymous",
);
#get method_link_species_set adaptor
my $mlss_adaptor = $reg->get_adaptor("Multi", "compara", "MethodLinkSpeciesSet");
my $mlss = $mlss_adaptor->fetch_by_method_link_type_species_set_name("GERP_CONSERVATION_SCORE", "mammals");
throw("Unable to find method_link_species_set") if (!defined($mlss));
my $slice_adaptor = $reg->get_adaptor($species, 'core', 'Slice');
throw("Registry configuration file has no data for connecting to <$species>") if (!$slice_adaptor);
my $cs_adaptor = $reg->get_adaptor("Multi", 'compara', 'ConservationScore');
while($line=<FILE1>){
#$line=$ARGV[0];
chomp $line;
my @parts=split('\t',$line);
my $seq_region = $parts[0];
my $seq_region_start = $parts[1];
my $seq_region_end = $parts[2];
my $slice = $slice_adaptor->fetch_by_region('toplevel', $seq_region, $seq_region_start, $seq_region_end);
throw("No Slice can be created with coordinates $seq_region:$seq_region_start-$seq_region_end") if (!$slice);
my $display_size = $slice->end - $slice->start + 1;
my $scores = $cs_adaptor->fetch_all_by_MethodLinkSpeciesSet_Slice($mlss, $slice, $display_size);
#print "number of scores " . @$scores . "\n";
foreach my $score (@$scores) {
if (defined $score->diff_score) {
# printf("position %d observed %.4f expected %.4f difference %.4f\n", $score->position, $score->observed_score, $score->expected_score, $score->diff_score);
printf("$parts[3]\t$parts[4]\t%.4f\n",$score->diff_score);
}
}
}
close FILE1;
Negative GERP scores suggest that the sites are probably evolving neutrally.
read.table(file="GERP.scores",header=FALSE,stringsAsFactors=FALSE)->M
as.data.frame(aggregate(M,list(M$V2),mean))->N
as.data.frame(aggregate(M,list(M$V2),max))->P
N$Group.1[N$V3<0]->negative.mean.GERP
jpeg("GERP_HGT_distribution.jpeg")
boxplot(as.numeric(M$V3)~M$V2,outline=FALSE,col=ifelse(unique(M$V2) %in% negative.mean.GERP, "red", "blue"),xaxt="n",ylab="GERP score",main="GERP score by gene")
legend("bottomright", c("Negative mean GERP", "Positive mean GERP"), fill = c("red", "blue"))
dev.off()
Despite 41 genes having a mean GERP score less than 0, each of these genes have atleast a few sites that are having positive GERP scores.
Based on above arguments, it could be suggested that 10 of the 145 genes are probably annotation artifacts. The 127 genes for which GERP scores could be obtained from whole genome alignments show moderate levels of conservation, suggesting these genes are probably not artifacts.
No comments:
Post a Comment