Question: Correct way of merging samples for father, mother, child trio variant calling
1
gravatar for eurioste
20 months ago by
eurioste20
eurioste20 wrote:

I am new to NGS data analysis and I'm working in a multiple-sample variant calling workflow. I have Illumina-Miseq fastq files (paired end, raw reads) for a father, mother and child trio, one pair for each individual, totalling 6 files. I could trim, align, do the pre-processing and variant calling for each individual pair separately (I'm skipping indel-realignment and quality recalibration, for the sake of simplicity, as this workflow is intended for learning only), but I wish to merge the samples into a single file. I wish that the alignment step (with BWA-MEN), the pre-processing steps (with Picard) and the variant calling step (with FreeBayes), are done at once for all samples, if possible and correct, while taking in consideration the correct paired end mates and the respective read groups (when applicable).

My final goal is to obtain a single vcf file from which I'll compute the total number of different kinds of variants.

At which step, in which file format and with which Galaxy tools can I merge the samples in a manner that I can get correct, meaninful results at the variant calling step?

bam variant calling vcf • 1.2k views
ADD COMMENTlink modified 20 months ago by Pierre Lindenbaum118k • written 20 months ago by eurioste20
1

I would suggest following the GATK best practices.

ADD REPLYlink written 20 months ago by WouterDeCoster37k

Hello eurioste!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?p=208960

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 20 months ago by WouterDeCoster37k

Sorry, I wasn't aware this was a bad practice, but, why could it annoy someone?

ADD REPLYlink modified 20 months ago • written 20 months ago by eurioste20

because people here will spend some time to answer you while you don't care anymore because the question has already been answered on another site.

ADD REPLYlink written 20 months ago by Pierre Lindenbaum118k

because we are a finite pool of volunteers who sacrifice time to help people, and it's not efficient if someone on seqanswers AND someone here invests time in answering the same question.

ADD REPLYlink modified 20 months ago • written 20 months ago by WouterDeCoster37k
3
gravatar for Pierre Lindenbaum
20 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

avoid to merge the VCFS : call them at the same time:

samtools mpileup [options...] father.bam mother.bam child.bam | bcftools call  [options...] > result.vcf

or something like:

java -jar GATK.jar -T HaplotypCaller [options...] -o result.vcf -i father.bam -i mother.bam child.bam

or you can use GATK + gVCF calling : http://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf

or you can merge the 3 VCF later with https://software.broadinstitute.org/gatk/documentation/tooldocs/3.5-0/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php , but you'll get some missing genotypes (./.) where you can't tell if the sample was HOM_REF or if there was not enough information to do the calling.

ADD COMMENTlink written 20 months ago by Pierre Lindenbaum118k
1

Calling the samples jointly not only has the benefit that you don't need to merge singleton VCFs, it also means that the calling of the samples will generally be better and more consistent:

  • you are less likely to have variant representation issues that can arise in independent calling that make it look like samples variants are different when they are actually the same. (Alternatively this can be alleviated by post-calling normalization and/or comparison using haplotype aware tools like hap.py or vcfeval).
  • calling jointly allows shared information about alleles and their frequencies across the samples giving more accurate calls.
  • joint pedigree-aware calling can improve the calls of the family members even further, giving the same call accuracy as if you had sequenced the samples to a higher level of coverage (although if your per-sample coverage is getting >100x these benefits may be minor), and allowing automatic identification of putative de novo variants. If you want to include the pedigree information directly during the calling, I recommend using the pedigree-aware callers from RTG Core (these can support families with more than three members, and multi-generation pedigrees). (disclaimer, I work for RTG)
ADD REPLYlink written 20 months ago by Len Trigg1.2k

Just to make sure I got it right, should I merge the bam files after the the alignment to reference step using samtools mpileup?

ADD REPLYlink modified 20 months ago • written 20 months ago by eurioste20
1

no, samtools mpileup call call the variants from a set of bam files. Output is one VCF file with multiple samples.

Usage: samtools mpileup [options] in1.bam [in2.bam [...]]
ADD REPLYlink written 20 months ago by Pierre Lindenbaum118k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 764 users visited in the last hour