Question

Diploid vs Haploid calls

0

Entering edit mode

6.8 years ago

Ghani • 0

Hi everyone,

My question is a basic one, but very important to understanding how sequence processing actually works.

What determines (how does a variant calling program know) whether a specific locus in a sequenced dna fragment should be diploid (e.g A/T or C/G) or haploid (e.g C/C, T/T).

Is it based on what the reads are for that locus, for example, if at least 25% of the reads for that locus are C, and the remainder T, then the position will be called C/T, whereas if less than 25% of the reads are C, and > 75% of the reads are T, then the position will be called T/T (the 1 or 2 Cs will be discarded as sequencing errors). My guess is if about 50% of the reads are C and the rest T, then the call may be C/T for that position, but I could be wrong, and I am not sure how the genotype for the reference at that position factors in.

or is it based on whether the human reference is diploid or haploid for that locus or a combination. Please give specific examples to illustrate how reads look for a diploid vs haploid call.

Thanks

sequencing alignment Assembly • 3.9k views

ADD COMMENT • link updated 6.8 years ago by Brian Bushnell 20k • written 6.8 years ago by Ghani • 0

score 1 · Answer 1 · 2017-07-07

The origin of the DNA that was sequenced determines the ploidy of it. How many sets of chromosomes were present to begin with? The variant caller will attempt to match that assumption as much as possible.

From your question though it appears that you are really asking about homozygous or heterozygous variants. Which is a related yet a different concept.

When the ploidy of the sample is known the variant caller might use different error models to decide the likelihood of these two variant types. It may still produce heterozygous calls if the data supports those, even when the ploidy is set to 1 (haploid).

score 1 · Answer 2 · 2017-07-07

1

Entering edit mode

6.8 years ago

Brian Bushnell 20k

Assuming mapped reads are equally likely to originate from either chromosome copy in a diploid, you can use binomial distribution math to calculate the probability that the allelic balance indicates a heterozygous site. For example, it's extremely unlikely under that assumption for a heterozygous site to have 95 reads indicating the reference and 5 indicating the variant (this would more likely be caused by other factors like errors, mismapping, mozaicism, etc), but it would not be unlikely to see 5 reads indicating the ref and 2 indicating the variant. Quality scores and other factors often used for calculating variant quality may also contribute to the ploidy call.

ADD COMMENT • link 6.8 years ago by Brian Bushnell 20k

0

Entering edit mode

I think what you are thinking is consistent with what I said. Best to illustrate with an example:

Let's assume that the sequenced locus is actually C/T. What we likely hopefully see assuming a read depth of 20 at that position, is approximatley 10 C reads and 10 T reads (correct me if I'm wrong). If the human reference is TT at that locus, then if everything goes will the caller should call CT, right?

In the 2nd example, let's assume that a single strand library is prepared (which is often the case with say ancient DNA that is degraded and has less than 5% endogenous material). In this case things get a little tricky I would think, because let's assume the actual SNP sequenced is a C and the read depth is 3, and we get 2 C reads and one T read. If the variant caller comes back with a CT because the reference is CT at that locus, we may have a problem. Is there a way to tell the variant caller that the actual sequenced material was single strand, and for it to return only 1 allele?

ADD REPLY • link 6.8 years ago by Ghani • 0

1

Entering edit mode

The first case seems correct.

For the second case, with 1 read indicating a variant, many callers will call ref/ref by default simply because calling variants indicated by a single read will result in too much noise, even though the allele frequency is 33% of the reads in that case.

But I'm not sure what you mean by "single strand library" or why that matters. Are you talking about the rates of some bases changing into other bases when the DNA is very old, but the complementary base being unaffected?

ADD REPLY • link 6.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks Brian, by single strand library, I mean those ancient DNA cases where the 2 strands are seperated and only 1 strand is sequenced..

In this situation would'nt it be better to tell the caller to make haploid calls, and if so what commands would you use with GATK Haplotypecaller. I have used BCF tools and Freebayes but for me Haplotypecaller has been the best.

ADD REPLY • link 6.8 years ago by Ghani • 0

0

Entering edit mode

I don't see that this matters. As long as there is sufficient depth, DNA will be recovered from both chromosomal copies and both strands at any given location.

ADD REPLY • link 6.8 years ago by Brian Bushnell 20k

score 0 · Answer 3 · 2017-07-07

0

Entering edit mode

6.8 years ago

Pierre Lindenbaum 161k

see

Mathematical Notes on SAMtools Algorithms

https://software.broadinstitute.org/gatk/media/docs/Samtools.pdf

... good luck...

ADD COMMENT • link 6.8 years ago by Pierre Lindenbaum 161k