Hi,
I have 384 samples I want to genotype with the GBS technology. This is the command I use :
sbatch -p gdec -c 128 --mem=128G --wrap='bcftools mpileup \
--threads 128 \
-q 10 -Q 20 \
--skip-indels \
--max-depth 500 \
-f reference.fasta \
*.bam \
--annotate FORMAT/AD,FORMAT/DP,INFO/AD | \
bcftools call \
--threads 128 \
-mv \
-Oz -o calls.denovo.vcf.gz'
I had a look on some genotypes, that I verified in the corresponding BAM files. Here are the alignments for a single sample at a specific position, here are four SNPs. I sorted the alignment by MAPQ in IGV (in red is the MAPQ threshold at 10) , and the depth at this locus is around 580X (I can't show all the screen)
It seems logical, as I applied the -q 10
filter, only the first and last variants are detected. The first variant is heterozygous in reality (but considered as homozygous with the >MAPQ10 reads). The two middle variants are only supported by <MAPQ10 reads, and then, it appears as reference here (although some reads carry the snp).
I did a new calling only with -q 1
and --max-depth 1000
to consider all the reads, and then, to detect the four variants. Here is my problem : the two middle snps are still not detected, the last one is fine (1/1), but the first one is :
h2tg000001l;385471;A;G;1/1:79,6,0:577:277,300
The AD is "277,300" which is heterozygous. Why do I get a 1/1 genotype? Also, why the two middle variants are not detected whereas they are 0/1 if I consider all the reads (-q 1)?
Do I miss something? Do I do something wrong? I get a bit lost..
Best
Why you are lowering the quality score to force the result you expect? Never eye-ball alignments, the variant callers use a bunch of different statistics and heuristics to call variants.
Are you expecting four variants because its an experimentally defined mixture? I think if you are trying to fudge the data to get what you expect, you should change these parameters in the
mpileup
command:set the max depth to 10,000 and
-q
and-Q
to zero.I don't suggest doing that, because it is the equivalent of P-hacking your variants.
I usually use those parameters -q 10 -Q 20 when I run
bcftools mpileup
, to get significant variants. I just try to trick the parameters to understand the situation.I did what you (don't) suggest but the results are the same, but I got exactly the same results.The first variant appears as 1/1 wheareas I should get a 0/1 genotype (and I still have the two not detected variants). One of the main problem is that the AD for the 1/1 genotype is
282,306
which is heterozygous, right? I should get a 0/1 genotypeAny suggestion?
I have no idea why you can not force the output. But at the same time, its (IMO) a fools errand (no offense) and a waste of time.
Maybe try not using
-v, --variants-only
flag?I do not have idea too actually. I just ask because there's a difference between what I get from bcftools and the IGV output. Which means
bcftools
gives wrong output? I already tried without the-v
flag. The main question is : whyGT:PL:DP:AD 1/1:79,6,0:577:277,300
outputs as 1/1 ? And not as 0/1 ?