Hi, I would like to subset my VCF from LongShot based on the genotype. I would like to get all the heterozygous SNVs. The end goal is to make a set (in a python script) of the SNV sites to build a data frame. I got a script that works on Illumina data, but the VCF from LongShot is different and has no allele frequency (AF) I can filter on.
I would like to do something like (.sh file):
bcftools filter -i "INFO/DP>=${DP} && SAMPLE/GT == 0|1 OR 1|0 OR 0/1 OR 1/0 && STRLEN(REF) == 1 && STRLEN(ALT) == 1 && N_ALT==1" longshot/longshot_vcf_subset/${SAMPLE}_${gene}_longshot_window${Window}_DP${DP}.vcf > longshot/longshot_vcf_subset_filtered/${SAMPLE}_${gene}_longshot_Filtered_SNVs_window${Window}_DP${DP}.vcf
I know that this outputs the genotype, maybe It can help with a solution:
bcftools annotate -x '^FORMAT/GT' longshot/longshot_vcf_subset/GM07541_HTT_longshot_window40000_DP10.vcf | grep -v "^##" | cut -f 10-
The output of the "bcftools annotate":
f 10-
SAMPLE
1|0
1|0
1|0
1|0
1|0
1|0
1|0
1|0
1|0
1|0
1|0
1|0
0|1
Here is a example of a LongShot VCF.
#fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##source=Longshot v0.4.0
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth of reads passing MAPQ filter">
##INFO=<ID=AC,Number=R,Type=Integer,Description="Number of Observations of Each Allele">
##INFO=<ID=AM,Number=1,Type=Integer,Description="Number of Ambiguous Allele Observations">
##INFO=<ID=MC,Number=1,Type=Integer,Description="Minimum Error Correction (MEC) for this single variant">
##INFO=<ID=MF,Number=1,Type=Float,Description="Minimum Error Correction (MEC) Fraction for this variant.">
##INFO=<ID=MB,Number=1,Type=Float,Description="Minimum Error Correction (MEC) Fraction for this variant's haplotype block.">
##INFO=<ID=AQ,Number=1,Type=Float,Description="Mean Allele Quality value (PHRED-scaled).">
##INFO=<ID=GM,Number=1,Type=Integer,Description="Phased genotype matches unphased genotype (boolean).">
##INFO=<ID=DA,Number=1,Type=Integer,Description="Total Depth of reads at any MAPQ (but passing samtools filter 0xF00).">
##INFO=<ID=MQ10,Number=1,Type=Float,Description="Fraction of reads (passing 0xF00) with MAPQ>=10.">
##INFO=<ID=MQ20,Number=1,Type=Float,Description="Fraction of reads (passing 0xF00) with MAPQ>=20.">
##INFO=<ID=MQ30,Number=1,Type=Float,Description="Fraction of reads (passing 0xF00) with MAPQ>=30.">
##INFO=<ID=MQ40,Number=1,Type=Float,Description="Fraction of reads (passing 0xF00) with MAPQ>=40.">
##INFO=<ID=MQ50,Number=1,Type=Float,Description="Fraction of reads (passing 0xF00) with MAPQ>=50.">
##INFO=<ID=PH,Number=G,Type=Integer,Description="PHRED-scaled Probabilities of Phased Genotypes">
##INFO=<ID=SC,Number=1,Type=String,Description="Reference Sequence in 21-bp window around variant.">
##FILTER=<ID=dn,Description="In a dense cluster of variants">
##FILTER=<ID=dp,Description="Exceeds maximum depth">
##FILTER=<ID=sb,Description="Allelic strand bias">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase Set">
##FORMAT=<ID=UG,Number=1,Type=String,Description="Unphased Genotype (pre-haplotype-assembly)">
##FORMAT=<ID=UQ,Number=1,Type=Float,Description="Unphased Genotype Quality (pre-haplotype-assembly)">
##contig=<ID=4>
##contig=<ID=X>
##bcftools_filterVersion=1.9+htslib-1.9
##bcftools_filterCommand=filter -r 4:3036603-3116774 longshot/merged_vcf/GM07541_longshot_hp_merged.vcf.gz; Date=Sun Sep 27 16:37:34 2020
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
4 3036630 . T C 130.36 PASS DP=22;AC=9,10;AM=3;MC=2;MF=0.105;MB=0.059;AQ=10.89;GM=1;DA=22;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=130,36,301,96;SC=CTTTGTGAAGTGTCATATAAT GT:GQ:PS:UG:UQ 1|0:128.75:1479218:0/1:92.32
4 3038112 . T G 125.11 PASS DP=22;AC=6,9;AM=7;MC=0;MF=0;MB=0.059;AQ=9.39;GM=1;DA=22;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=125,11,269,37;SC=ACCTTGAGGGTATGAATGAGT GT:GQ:PS:UG:UQ 1|0:106.42:1479218:0/1:65.36
4 3038415 . A G 213.25 PASS DP=22;AC=6,15;AM=1;MC=1;MF=0.048;MB=0.059;AQ=15.81;GM=1;DA=22;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=213,25,339,17;SC=GGAGCTGCAGACCTGGGTTCC GT:GQ:PS:UG:UQ 1|0:88.15:1479218:0/1:48.53
4 3038551 . T C 131.62 PASS DP=21;AC=8,11;AM=2;MC=1;MF=0.053;MB=0.059;AQ=12.26;GM=1;DA=21;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=131,62,297,39;SC=GTTCCTCAGGTTTTCTCCGGG GT:GQ:PS:UG:UQ 1|0:126.43:1479218:0/1:92.17
4 3038639 . T A 176.79 PASS DP=21;AC=7,12;AM=2;MC=0;MF=0;MB=0.059;AQ=12.07;GM=1;DA=21;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=176,79,346,91;SC=TGGTGGCCTCTGTCCCTGTTC GT:GQ:PS:UG:UQ 1|0:132.35:1479218:0/1:79.61
4 3038713 . A G 189.89 PASS DP=21;AC=7,14;AM=0;MC=2;MF=0.095;MB=0.059;AQ=15.31;GM=1;DA=21;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=189,89,322,93;SC=TGCATCCCTGACCTCTGTTTG GT:GQ:PS:UG:UQ 1|0:95.27:1479218:0/1:55.73
4 3039150 . T C 170.22 PASS DP=22;AC=8,12;AM=2;MC=1;MF=0.05;MB=0.059;AQ=12.06;GM=1;DA=22;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=170,22,341,65;SC=CAGTTCTCGGTGGTGAAAGGG GT:GQ:PS:UG:UQ 1|0:133.66:1479218:0/1:85.57
4 3039311 . C T 148.91 PASS DP=22;AC=5,14;AM=3;MC=1;MF=0.053;MB=0.059;AQ=11.25;GM=1;DA=22;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=148,91,260,41;SC=TGTGTGTGTCCGTGTGTGTGT GT:GQ:PS:UG:UQ 1|0:73.73:1479218:0/1:41.45
4 3039470 . T C 53 PASS DP=21;AC=7,10;AM=4;MC=5;MF=0.294;MB=0.059;AQ=11.4;GM=1;DA=21;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=53,0,128,20;SC=AGTGAAACCCTGTCTCTACTA GT:GQ:PS:UG:UQ 1|0:37.31:1479218:0/1:80.94
4 3039819 . C T 62.21 PASS DP=20;AC=6,10;AM=4;MC=5;MF=0.312;MB=0.059;AQ=10.76;GM=1;DA=20;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=62,21,112,62;SC=CCTGCCACCCCGAGCCCCACT GT:GQ:PS:UG:UQ 1|0:12.86:1479218:0/1:47.4
4 3040573 . A C 7.29 PASS DP=21;AC=6,4;AM=11;MC=0;MF=0;MB=0.059;AQ=6.04;GM=0;DA=21;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=7,29,157,94;SC=TGGCACTCACAGTGCATCTCT GT:GQ:PS:UG:UQ 1|0:7.29:1479218:0/0:18.49
4 3040587 . T C 91.14 PASS DP=21;AC=6,9;AM=6;MC=0;MF=0;MB=0.059;AQ=10.22;GM=1;DA=21;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=91,14,224,40;SC=CATCTCTCCCTGCTCTGGCTC GT:GQ:PS:UG:UQ 1|0:89.78:1479218:0/1:50.63
4 3041890 . G A 20.94 PASS DP=23;AC=11,8;AM=4;MC=4;MF=0.211;MB=0.059;AQ=9.53;GM=1;DA=23;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=20,94,0,4;SC=CCCCCTGGCCGGGCACGCTGG GT:GQ:PS:UG:UQ 0|1:20.94:1479218:0/1:21.35
4 3093312 . G A 324.82 PASS DP=21;AC=0,19;AM=2;MC=0;MF=0;MB=0;AQ=13.64;GM=1;DA=21;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;PH=324,82,152,74;SC=TATAAATATCGTTAAGAAAAA GT:GQ:PS:UG:UQ 1/1:152.72:.:1/1:53.17
Thank you so much! What a nice and simple solution. I need to remember to check the bcftools documentation next time something like this is bugging me!