bcftools usage (Updating)
Select genotypes from multi samples
We got a vcf file named as trio.vcf containing 3 samples for a trio family for genetic analysis:
\# ...
\# ...
\# ...
\#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3
1 12783 . G A 607.89 PASS AC=3;AF=0.750;AN=4;BaseQRankSum=0.00;DP=27;ExcessHet=3.0103;FS=0.000;MLEAC=3;MLEAF=0.750;MQ=26.35;MQRankSum=0.621;QD=22.51;ReadPosRankSum=1.73;SOR=2.753 GT:AD:AF:DP:GQ:JL:JP:PL:PP 0/1:5,17:0.773:22:88:-1:-1:463,0,88:478,0,88 1/1:0,5:1.00:5:13:-1:-1:157,15,0:157,13,0 ./.:0,0:.:0:.:.:.:0,0,0
1 13178 . G A 56.48 PASS AC=1;AF=0.167;AN=6;BaseQRankSum=1.48;DP=254;ExcessHet=3.0103;FS=9.066;MLEAC=1;MLEAF=0.167;MQ=25.58;MQRankSum=-9.600e-02;QD=2.09;ReadPosRankSum=-6.200e-02;SOR=3.287 GT:AD:AF:DP:GQ:JL:JP:PL:PP 0/0:218,0:.:218:99:3:3:0,120,1800:0,120,1803 0/1:22,5:0.185:27:64:3:3:64,0,543:64,0,603 0/0:9,0:.:9:0:3:3:0,0,194:0,0,254
...
...
The header for #… means headers of vcf, just for simplicity. And how to select all genotypes identical for all 3 samples?
Before going on next, how to select specified genotype for ONE sample? bcftools view -i 'GT[0]~"^1.1"' trio.vcf
, GT[0]~”^1.1” means first sample’s GT is match pattern ‘^1.1’, which can be 1/1, 1|1.
So, we can use bcftools view -i 'GT[0]==GT[1]' trio.vcf
to select subvcf that genotype of sample1 and sample2 is identical. But! Bcftools will interrupt with “Comparison between samples is not supported, sorry!” Unfortunately, we can’t do comparions between samples using bcftools.
awk for substitution
awk '{if($0 !~ /^#/){split($(NF-2),arr1,":");split($(NF-1),arr2,":");split($NF,arr3,":");if (arr1[1]==arr2[1] && arr2[1]==arr3[1]){print $0}}else if($0 ~ /^#/)print $0 }' trio.vcf
This awk select identical genotypes in all trio samples. Take note that keep origin headers of vcf for downstream usage with bcftools.