Question: VariantToTable not calling HET HOM etc
0
gravatar for geneart$$
10 weeks ago by
geneart$$40
United States
geneart$$40 wrote:

Hi all,

I was trying to get stats on # HOM VAR #HET etc from my vcf file using VariantsToTable. As below:

gatk VariantsToTable
 -R /hg38/Homo_sapiens_assembly38.fasta \
-V all_jointcalls_annRegion.vcf \
-F CHROM -F ID -F POS -F REF -F ALT -F QUAL -F DP \
-GF GT -GF GQ  -GF HET -GF HOM-REF \
-O all_jointcalls_trial.table

But no HET or HOM etc is called across board !

My output:

CHROM   ID      POS     REF     ALT     QUAL    DP      sample1.GT   sample1.GQ     sample1.HET    sample1.HOM-REF 
chr2    rs2303425    47403074        T       C       11431.45        4041    T/T     99    NA      NA

However when I used Dave Tang's way: https://github.com/davetang/learning_vcf_file#extracting-info-fields

bcftools stats -s - all_jointcalls_annRegion.vcf | grep -A 169 > "Per-sample counts"

I get counts:

# PSC, Per-sample counts. Note that the ref/het/hom counts include only SNPs, for indels see PSI. The rest include both SNPs and indels.
# PSC   [2]id   [3]sample       [4]**nRefHom**      [5]**nNonRefHom**   [6]**nHets**      [7]nTransitions [8]nTransversions       [9]nIndels      [10]average depth       [11]nSingletons [12]nHapRef
     [13]nHapAlt     [14]nMissing
PSC     0       sample1        **658**    **38**     **65**     63      40      0       3.6     8       0       0       303

This is true across all variants across all samples that I have in my multi sample vcf file. bcftools gives me numbers whereas GATK VariantsToTable is a NA.

What is not right here with VariantsToTable`? I read somewhere on GATK posts on their site that they do not recommend other ways of parsing vcf generated by GATK, so a little hesitant using bcftools. Anything wrong in parsing it with bcftools?

variantstotable gatk • 180 views
ADD COMMENTlink modified 10 weeks ago by _r_am31k • written 10 weeks ago by geneart$$40
1
gravatar for Kevin Blighe
9 weeks ago by
Kevin Blighe68k
Republic of Ireland
Kevin Blighe68k wrote:

Can you confirm that you have first posted this on GATK Support Forum? That would be the better place to post.

Similar / related answers:

Kevin

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Kevin Blighe68k
1

Thankyou Kevin. Maybe GATK variantsToTable works well with single sample vcf rather than multi sample vcf. I did go through your code from the links posted and that was great ! Thankyou so much!

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by geneart$$40

Hi Kevin, I tried your code from the link above A: How to get sample names and genotype for SNP in multi-sample VCF file and it works well and does what I need. However I noticed a small discrepancy. I have total of 2931 variants across all my samples in my cohort across 8 genes I am interested in. I broke it down to the genes I am interested in and I get totals per gene as follows:

gene 1: 1065
gene 2: 190
gene 3: 107
gene 4: 305
gene 5: 84
gene 6: 946
gene 7: 125
gene 8: 109
Total add to 2931 and I am good !

Problem : I looked at distribution of homRef, homALT and het across a subset of 10 individuals for gene 1. Per sample their homRef, homALt, Het, nocalls must add to 1065. However in couple samples I see they add to 1063 and in 1 sample it comes to 1062. Why is it missing the counts in those cases? Code was run as is without change.I normalised my vcf as you had mentioned in your previous link.

Could it be that in those samples with lower counts, that there are multiallelics that are just counted once? hence the discrepancy?whereas in the samples that counted to exact 1065, they all were biallelic? I am not quite understanding this part. Any insight would help me better understand. Thankyou in advance !

ADD REPLYlink modified 9 weeks ago by Kevin Blighe68k • written 9 weeks ago by geneart$$40
1

Hi and thanks for testing that code. With no way for me to reproduce this observed behaviour, I am not to know what could be the issue. Multi-allelic sites are common causes of inconsistencies, and also sites that are missing, like ./.. In this case, it is likely missing sites because I do not seem to account for those in my code.

If you isolate the 'problematic' samples, can you observe any missing genotypes?

ADD REPLYlink written 9 weeks ago by Kevin Blighe68k
1

Kevin, my apologies. Your code works fine. Too many things happening simultaneously for me ! going crazy ! The output I was mentioning was from using bcftools stats and that is where I am getting variations.

I cant understand why bcftools would miss a couple counts. I dont have much experience with bcftools, I implementing it and learning about what all it can do simultaneously. From your experience ( or anyone else reading the forum) any suggestions?

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by geneart$$40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1878 users visited in the last hour