How do I subset VCF based on genotype?
1
0
Entering edit mode
3.6 years ago
Tine ▴ 10

Hi, I would like to subset my VCF from LongShot based on the genotype. I would like to get all the heterozygous SNVs. The end goal is to make a set (in a python script) of the SNV sites to build a data frame. I got a script that works on Illumina data, but the VCF from LongShot is different and has no allele frequency (AF) I can filter on.

I would like to do something like (.sh file):

        bcftools filter -i "INFO/DP>=${DP} && SAMPLE/GT == 0|1 OR 1|0 OR 0/1 OR 1/0 && STRLEN(REF) == 1 && STRLEN(ALT) == 1 && N_ALT==1" longshot/longshot_vcf_subset/${SAMPLE}_${gene}_longshot_window${Window}_DP${DP}.vcf  > longshot/longshot_vcf_subset_filtered/${SAMPLE}_${gene}_longshot_Filtered_SNVs_window${Window}_DP${DP}.vcf

I know that this outputs the genotype, maybe It can help with a solution:

bcftools annotate -x '^FORMAT/GT' longshot/longshot_vcf_subset/GM07541_HTT_longshot_window40000_DP10.vcf | grep -v "^##" | cut -f 10-

The output of the "bcftools annotate":

 f 10-
SAMPLE
1|0
1|0
1|0
1|0
1|0
1|0
1|0
1|0
1|0
1|0
1|0
1|0
0|1

Here is a example of a LongShot VCF.

      #fileformat=VCFv4.2
    ##FILTER=<ID=PASS,Description="All filters passed">
    ##source=Longshot v0.4.0
    ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth of reads passing MAPQ filter">
    ##INFO=<ID=AC,Number=R,Type=Integer,Description="Number of Observations of Each Allele">
    ##INFO=<ID=AM,Number=1,Type=Integer,Description="Number of Ambiguous Allele Observations">
    ##INFO=<ID=MC,Number=1,Type=Integer,Description="Minimum Error Correction (MEC) for this single variant">
    ##INFO=<ID=MF,Number=1,Type=Float,Description="Minimum Error Correction (MEC) Fraction for this variant.">
    ##INFO=<ID=MB,Number=1,Type=Float,Description="Minimum Error Correction (MEC) Fraction for this variant's haplotype block.">
    ##INFO=<ID=AQ,Number=1,Type=Float,Description="Mean Allele Quality value (PHRED-scaled).">
    ##INFO=<ID=GM,Number=1,Type=Integer,Description="Phased genotype matches unphased genotype (boolean).">
    ##INFO=<ID=DA,Number=1,Type=Integer,Description="Total Depth of reads at any MAPQ (but passing samtools filter 0xF00).">
    ##INFO=<ID=MQ10,Number=1,Type=Float,Description="Fraction of reads (passing 0xF00) with MAPQ>=10.">
    ##INFO=<ID=MQ20,Number=1,Type=Float,Description="Fraction of reads (passing 0xF00) with MAPQ>=20.">
    ##INFO=<ID=MQ30,Number=1,Type=Float,Description="Fraction of reads (passing 0xF00) with MAPQ>=30.">
    ##INFO=<ID=MQ40,Number=1,Type=Float,Description="Fraction of reads (passing 0xF00) with MAPQ>=40.">
    ##INFO=<ID=MQ50,Number=1,Type=Float,Description="Fraction of reads (passing 0xF00) with MAPQ>=50.">
    ##INFO=<ID=PH,Number=G,Type=Integer,Description="PHRED-scaled Probabilities of Phased Genotypes">
    ##INFO=<ID=SC,Number=1,Type=String,Description="Reference Sequence in 21-bp window around variant.">
    ##FILTER=<ID=dn,Description="In a dense cluster of variants">
    ##FILTER=<ID=dp,Description="Exceeds maximum depth">
    ##FILTER=<ID=sb,Description="Allelic strand bias">
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
    ##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase Set">
    ##FORMAT=<ID=UG,Number=1,Type=String,Description="Unphased Genotype (pre-haplotype-assembly)">
    ##FORMAT=<ID=UQ,Number=1,Type=Float,Description="Unphased Genotype Quality (pre-haplotype-assembly)">
    ##contig=<ID=4>
    ##contig=<ID=X>
    ##bcftools_filterVersion=1.9+htslib-1.9
    ##bcftools_filterCommand=filter -r 4:3036603-3116774 longshot/merged_vcf/GM07541_longshot_hp_merged.vcf.gz; Date=Sun Sep 27 16:37:34 2020
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
    4   3036630 .   T   C   130.36  PASS    DP=22;AC=9,10;AM=3;MC=2;MF=0.105;MB=0.059;AQ=10.89;GM=1;DA=22;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=130,36,301,96;SC=CTTTGTGAAGTGTCATATAAT  GT:GQ:PS:UG:UQ  1|0:128.75:1479218:0/1:92.32
    4   3038112 .   T   G   125.11  PASS    DP=22;AC=6,9;AM=7;MC=0;MF=0;MB=0.059;AQ=9.39;GM=1;DA=22;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=125,11,269,37;SC=ACCTTGAGGGTATGAATGAGT    GT:GQ:PS:UG:UQ  1|0:106.42:1479218:0/1:65.36
    4   3038415 .   A   G   213.25  PASS    DP=22;AC=6,15;AM=1;MC=1;MF=0.048;MB=0.059;AQ=15.81;GM=1;DA=22;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=213,25,339,17;SC=GGAGCTGCAGACCTGGGTTCC  GT:GQ:PS:UG:UQ  1|0:88.15:1479218:0/1:48.53
    4   3038551 .   T   C   131.62  PASS    DP=21;AC=8,11;AM=2;MC=1;MF=0.053;MB=0.059;AQ=12.26;GM=1;DA=21;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=131,62,297,39;SC=GTTCCTCAGGTTTTCTCCGGG  GT:GQ:PS:UG:UQ  1|0:126.43:1479218:0/1:92.17
    4   3038639 .   T   A   176.79  PASS    DP=21;AC=7,12;AM=2;MC=0;MF=0;MB=0.059;AQ=12.07;GM=1;DA=21;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=176,79,346,91;SC=TGGTGGCCTCTGTCCCTGTTC  GT:GQ:PS:UG:UQ  1|0:132.35:1479218:0/1:79.61
    4   3038713 .   A   G   189.89  PASS    DP=21;AC=7,14;AM=0;MC=2;MF=0.095;MB=0.059;AQ=15.31;GM=1;DA=21;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=189,89,322,93;SC=TGCATCCCTGACCTCTGTTTG  GT:GQ:PS:UG:UQ  1|0:95.27:1479218:0/1:55.73
    4   3039150 .   T   C   170.22  PASS    DP=22;AC=8,12;AM=2;MC=1;MF=0.05;MB=0.059;AQ=12.06;GM=1;DA=22;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=170,22,341,65;SC=CAGTTCTCGGTGGTGAAAGGG   GT:GQ:PS:UG:UQ  1|0:133.66:1479218:0/1:85.57
    4   3039311 .   C   T   148.91  PASS    DP=22;AC=5,14;AM=3;MC=1;MF=0.053;MB=0.059;AQ=11.25;GM=1;DA=22;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=148,91,260,41;SC=TGTGTGTGTCCGTGTGTGTGT  GT:GQ:PS:UG:UQ  1|0:73.73:1479218:0/1:41.45
    4   3039470 .   T   C   53  PASS    DP=21;AC=7,10;AM=4;MC=5;MF=0.294;MB=0.059;AQ=11.4;GM=1;DA=21;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=53,0,128,20;SC=AGTGAAACCCTGTCTCTACTA GT:GQ:PS:UG:UQ  1|0:37.31:1479218:0/1:80.94
    4   3039819 .   C   T   62.21   PASS    DP=20;AC=6,10;AM=4;MC=5;MF=0.312;MB=0.059;AQ=10.76;GM=1;DA=20;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=62,21,112,62;SC=CCTGCCACCCCGAGCCCCACT   GT:GQ:PS:UG:UQ  1|0:12.86:1479218:0/1:47.4
    4   3040573 .   A   C   7.29    PASS    DP=21;AC=6,4;AM=11;MC=0;MF=0;MB=0.059;AQ=6.04;GM=0;DA=21;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=7,29,157,94;SC=TGGCACTCACAGTGCATCTCT GT:GQ:PS:UG:UQ  1|0:7.29:1479218:0/0:18.49
    4   3040587 .   T   C   91.14   PASS    DP=21;AC=6,9;AM=6;MC=0;MF=0;MB=0.059;AQ=10.22;GM=1;DA=21;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=91,14,224,40;SC=CATCTCTCCCTGCTCTGGCTC    GT:GQ:PS:UG:UQ  1|0:89.78:1479218:0/1:50.63
    4   3041890 .   G   A   20.94   PASS    DP=23;AC=11,8;AM=4;MC=4;MF=0.211;MB=0.059;AQ=9.53;GM=1;DA=23;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=20,94,0,4;SC=CCCCCTGGCCGGGCACGCTGG   GT:GQ:PS:UG:UQ  0|1:20.94:1479218:0/1:21.35
    4   3093312 .   G   A   324.82  PASS    DP=21;AC=0,19;AM=2;MC=0;MF=0;MB=0;AQ=13.64;GM=1;DA=21;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=324,82,152,74;SC=TATAAATATCGTTAAGAAAAA  GT:GQ:PS:UG:UQ  1/1:152.72:.:1/1:53.17
snp • 1.0k views
ADD COMMENT
2
Entering edit mode
3.6 years ago

Hi,

There are indeed different ways to do this, including via standard shell commands. However, bcftools view should be able to do it for you:

bcftools view -g het NEO29T1.vcf ;

If you have multi-allelic sites, you may first want to split these into separate records:

bcftools norm -m-any NEO29T1.vcf | bcftools view -g het ;

Kevin

ADD COMMENT
1
Entering edit mode

Thank you so much! What a nice and simple solution. I need to remember to check the bcftools documentation next time something like this is bugging me!

ADD REPLY

Login before adding your answer.

Traffic: 2584 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6