phasing variants to find de novos
2
1
Entering edit mode
5.0 years ago
lait ▴ 170

Hi,

I am trying to build a pipeline for detecting de novo mutations for trios WES data. After obtaining the VCF file, and performing all the possible recalibration and refinement steps, I reached this step where I have to phase my variants.

As I have understood, please correct me if the following is wrong:

1- phasing means relating the genotypes to their paternal and maternal origin

2- The de novo mutations will not be phased, because they violate the mendelian law of inheritance

3- PhaseByTransmission can be used to do this job perfectly: i.e. to phase varinats and find denovos

phasing de novo NGS whole exome sequencing • 2.5k views
1
Entering edit mode

I'm not sure if phasing is the best way to find de novo variants. Why don't you just look (from the vcf) for variants which are present in the child but not in the parents?

0
Entering edit mode

because this will produce a large number of denovos

2
Entering edit mode

And that's not what you want? A variant which is found in a child but not in a parent is by definition a de novo variant, no?

4
Entering edit mode

Yes, I think that OP is perhaps thinking that phasing is what allows you to find de novos (by looking for variants which are not phased). But this is certainly not reliable, e.g:

• phasing by transmission may fail to phase germline variants if it cannot be determined which parent transmitted which allele (e.g. a site where all members of a trio are heterozygous)

• read backed phasing may fail to phase germline variants if there are no nearby heterozygotes to phase with respect to.

• de novo variants may be phased physically with respect to nearby pedigree phased variants, and this would let you identify which haplotype is affected (e.g. to identify compound heterozygosity)

So, phasing is useful for it's own reasons, but finding de novos is not one of them.

0
Entering edit mode

Thanks a lot. now it is all starting to make sense, as you said, I thought phasing is a way to find denovos. This drives me to ask, what benefit then would someone get from phasing denovos (or variants in general) ? is it just to know which haplotype (where the denovo is located) is affected ? and thus we would know if it is the father or the mother who is the reason (in a way) behind causing this denovo ? or are there other benefits.

5
Entering edit mode
5.0 years ago
Len Trigg ★ 1.6k

1) There are two common types of phasing: phasing by pedigree (by relating the alleles to the paternal or maternal origin via allele transmission), or local phasing (by determining which variants are locally in phase with each other, either by something like read backed phasing, or variant callers that directly call local haplotypes).

2) The de novo mutations can not be directly phased by pedigree, but they can be locally phased with respect to nearby variants (which may themselves be phased by pedigree).

3) You could use PhaseByTransmission, but you will obtain a better overall result if you jointly call the family using a pedigree-aware variant caller, such as rtg family or (for larger pedigrees) rtg population from RTG Core. This is because the pedigree-aware joint calling allows the evidence for each of the samples to influence the calls in other members of the pedigree (in rtg population, you can even use this to impute genotypes for missing family members during calling, which gets better the more family members you have). These callers automatically phase the output variants according to the pedigree, and directly output VCF annotations indicating putative de novo variants (including a de novo specific score).

(RTG Core also includes the rtg mendelian command which is useful for annotating VCFs for mendelian inheritance errors etc, and this command is also present in the smaller utility package RTG Tools). Disclaimer: I work for RTG :-).

0
Entering edit mode

So, does using the rtg family has more advantage than using the Genotype Reginement workflow of GATK ?

In this wokflow, they annotate possible denovos taking into consideration the pedigree information. I should mention that in prior steps (before doing the genotype refinement), when variants are called, Haplotype caller is used and ReadBackedPhasing is performed, but without taking into consideration the pedigree information.

So the GATK's workflow is something like:

• 1- preprocessing (alignmnent, marking duplicates, recalibration ... etc)
• 2- varinat calling: (Haplotype caller where ReadBackedPhasing is used without pedigree information)
• 3- Joint genotyping
• 4- Varinats recalibration
• 5- Genotype Refinement workflow, where pedigree information is used and possible denovos are annotated.

Do you think this would be a good approach to find denovos, or should I consider rtg family?

0
Entering edit mode

AFAIK, HaplotypeCaller is not actually using the ReadBackedPhasing algorithm, but does do local physical phasing based on the reads crossing the haplotypes it is calling within it's active region. The differences will depend on things like the size of the HC active region etc. RTG currently does not do local assembly, so RTG haplotype lengths are limited to less than the read length. HC does do local assembly, so will call longer active regions than RTG.

For steps 2, 3 and 5, I would definitely favour the RTG style approach where these are all done simultaneously, for the following reason: Calling the samples separately can introduce arbitrary but equivalent differences in variant representation that make it look like samples do not share variants when in fact they do. This can confound subsequent steps 3 and 5 which assume identical representations. For more information on this representation issue, see this brief example from hap.py or read about RTG vcfeval on bioRxiv.

The split approach is more scalable if you are calling 1000s of samples, but for trios simultaneous calling is better.

For step 4 RTG automatically performs an equivalent step via pre-built AVR models (we've also had success using the AVR model pre-built for somatic mutation ranking in cancer to rank de novo and mosaic variants).

If you can't get GATK to simultaneously call the whole trio, you could use RTG family for the simultaneous calling and then run ReadBackedPhasing to attempt physical phasing of the de novos w.r.t the pedigree phased calls.

1
Entering edit mode
5.0 years ago

As far as I'm aware:

1: Phasing tries to relate the reads/ genotypes back to their paternal or maternal origins of the genomic DNA.

2: True de-novo mutations are Mendelian violations, and phasing wouldn't work, as by nature they're different from the paternal and maternal DNA. You shouldn't really need phasing to identify de-novo mutations.

3: From GATK, PhaseByTransmission and ReadBackedPhasing are able to assist in phasing, but if you run the gVCF and JointGenotyping workflow, ReadBackedPhasing should be performed automatically.

0
Entering edit mode

Thanks Andrew for your answer. Could you please clarify why it is not possible to phase de novo mutations by pedigree? ex: if mother has Aa, father has aa, child has Ab, wouldn't is be possible to deduce that A (in the child) was transmitted from the mother, and therefore, the de novo mutation 'b' is from the father?

0
Entering edit mode

Let's say you are looking at small variants, e.g. SNPs as your inheritance unit. SNPs occur roughly one per kilobase (in humans). So when a de novo variant occurs, in 999 cases out of 1000 you will see mother aa, father aa, child ab. That's why you need the physical phasing to link the de novo with other variants that /can/ be phased by pedigree.