Question: Identifying De Novo Variants In Trio Data
gravatar for Vivek
8.7 years ago by
Vivek2.5k wrote:

I have trio datasets that I have phased using GATK's PhaseByTransmission and ReadbackedPhasing walkers.

My target is to identify de novo mutations from this data.

I'm creating a candidate de novo mutations dataset by checking for variants that are present in the offspring and not in either of the parents as well as looking for variant sites where there are mendelian violations.

I'd like to know how to proceed in filtering through this dataset to confidently ascertain variants that are de novo from the rest.

I'd appreciate any inputs/ideas on creating a methodology to go about this analysis.

ADD COMMENTlink modified 4.4 years ago by daniel30 • written 8.7 years ago by Vivek2.5k


  Have you identified de novo variants in trio data? Now, i have been working about it. Could you please share your workflow or some scripts with me ?

thx in advance!

ADD REPLYlink written 6.5 years ago by 89759864490
gravatar for Alex Paciorkowski
8.7 years ago by
Rochester, NY USA
Alex Paciorkowski3.4k wrote:

Your workflow might look something like this:

Generate VCF files of your trios with SNPs and indels with GATK, and then annotate with annovar or seattleseq. Also, use GATK to calculate your depth of coverage for your target.

Start by filtering out SNPs that are in dbSNP -- these are not likely to be pathogenic variants (but could be rare disease alleles, so be careful, you may need to go back and reanalyze...)

If you have scripting skills in something nice like Perl or Python, write a couple of scripts to pull out nonsynonymous (nonsense, missense) variants that obey your hypotheses (you mention de novo/sporadic). This gives you a shortened list of potential disease-causing variants.

Using your depth-of-coverage data should let you weed out further variants in areas of low coverage that may be crap. Then again, be careful, they might not be, and you may need to go back and reanalyze...

Annotate your shorter list of variants through the Exome Variant Server, to kick out the variants seen there that are likely to be not-so-rare alleles that do not cause disease.

Mix well, and repeat steps as needed. Remember, you may need to alter key parameters at each step and reanalyze... If you are unlucky, you may need to pull in gene ontology data or data about gene function in other organisms to help you rank variants...

Finally, any variants you identify need to be validated with Sanger... and then the fun begins. You need to validate further by sequencing in larger cohorts or do some functional wet-lab experiments to generate biologically relevant data.

Good luck!

ADD COMMENTlink written 8.7 years ago by Alex Paciorkowski3.4k

Thanks for the input. I already annotate my VCFs with data from dbSNP, 1000 genomes and ESP variants so I can remove the variants with relatively high allele frequencies in these databases.

By doing a quick parsing with perl I'm still ending up with quite a high number, so I will likely need to look for further filtering criteria.

The read coverage for my data is around 90x, which is sufficiently good to expect quality variant calls.

ADD REPLYlink modified 8.7 years ago • written 8.7 years ago by Vivek2.5k

Yes, that can be the way it goes with sporadics. There can be more than enough sporadic variants. Do your trios have the same phenotype? If so, look for de novo nonsynonymous variants in shared genes among your probands. The same phenotype can also be caused by mutations in genes in the same pathway, so some pathway analysis may help you. Are there known genes causing similar phenoytpe to the one you are studying? Look for variants in genes in the same pathways (assuming any of these data are known...often they are not...)

ADD REPLYlink modified 8.7 years ago • written 8.7 years ago by Alex Paciorkowski3.4k

I need to find out the de novo mutation rate as well so I don't think I can confine myself to non synonymous mutations. However going after a filtering criterion based on read depth at the candidate positions seems to be a promising option.

I could remove sites that have a low number of reads supporting the variant call in any of the trio samples.

ADD REPLYlink written 8.7 years ago by Vivek2.5k
gravatar for JC
8.7 years ago by
JC12k wrote:

Definitively you will need strong statistics in coverage and quality calls in each candidate position, because a large portion of them will be artefacts from the sequencer (platforms have their own bias). I also double check with other SNP callers (samtools, varscan, ...).

ADD COMMENTlink written 8.7 years ago by JC12k

The variants themselves are an intersection from GATK and Samtools callers but the phasing was done using GATK walkers. I'm looking at relevant publications to check for any existing methods as well.

ADD REPLYlink modified 8.7 years ago • written 8.7 years ago by Vivek2.5k

This is really true if coverage is low. If you have good quality coverage, however (~90-100x) my experience is the proportion of artifacts of sequencing after running through BWA and samtools/Picard is rather low. By the time you get done with GATK and have generated vcf files, you should be dealing with mostly good-quality calls.

ADD REPLYlink written 8.7 years ago by Alex Paciorkowski3.4k
gravatar for daniel
4.4 years ago by
United Kingdom
daniel30 wrote:

Just thought I'd shamelessly give our new haplotype-based variant caller octopus a mention here. It has a built in trio model that is able to classify called variants as de novo. There is no need for read pre-processing or messy post-hoc VCF intersections. Calls are phased by default.

We are in an early alpha release right now but are eager to get feedback, especially on the de novo calling (octopus also has standard germline calling, and a somatic caller built in).

ADD COMMENTlink written 4.4 years ago by daniel30

I'm interested in utilizing your variant caller, the link you provided isn't functional. Would you please provide updated information?

ADD REPLYlink written 4.0 years ago by alpha.biostat0

Octopus is now back online - the link should now work.

ADD REPLYlink written 3.3 years ago by daniel30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1120 users visited in the last hour