In certain cases, its required to have sites from a VCF file filtered by fraction of genotypes missing by population. While entire sites can be filtered using programs like VCFtools, its not possible to filter them by population. Here its assumed that all individuals from $ARGV[1] to $ARGV[2] are in one population and those from $ARGV[3] to $ARGV[4] are in another.
#!/usr/bin/perl
use List::Util qw[min max];
# Input parameters
my $vcf_file = $ARGV[0];
my $firstmissing=0;
my $firstall=$ARGV[2]-$ARGV[1];
my $secondmissing=0;
my $secondall=$ARGV[4]-$ARGV[3];
#Go through fasta file, extract sequences
open(IN, $vcf_file);
while($z=<IN>) {
$firstmissing=0;
$secondmissing=0;
if($z=~m/^#/){print $z;}
else{
chomp $z;
@values=split(/\t/,$z);
$totind=length @values;
for($i=$ARGV[1];$i<=$ARGV[2];$i++){if($values[$i]=~m/\.\/\./){$firstmissing++;}}
for($i=$ARGV[3];$i<=$ARGV[4];$i++){if($values[$i]=~m/\.\/\./){$secondmissing++;}}
if((($secondmissing/$secondall)<0.5)&&(($firstmissing/$firstall)<0.5)){print "$z\n";}
}#end of else #
}#end of file while
close IN;
No comments:
Post a Comment