Question

Variant calling in large sample populations

0

Entering edit mode

5.2 years ago

robjohn70000 ▴ 150

Hi,

I will like to carry out variant calling from fastq files (whole genomes from a large number of samples ~ several hundreds) for genetics association studies. I have come across some pipelines but not sure which one is the best for what I want to do.

Can anyone with experience in batch variant calling suggest the fast and best pipelines to help with this kind of work. Another question is: as I just want to generate genotypes based on human reference genome for association studies, and using GATK for instance, do I need to use HaplotypeCaller or MuTect for variant calling?

Any advice for batch runs for variant calling will also be welcome. Thanks

sequencing genome sequence gatk • 2.2k views

ADD COMMENT • link updated 5.2 years ago by Pierre Lindenbaum 161k • written 5.2 years ago by robjohn70000 ▴ 150

0

Entering edit mode

Are you aware that hundreds of WGS samples will consume several tens of terabytes for raw data alone? Do you have the computational resources to handle these amounts of data and the respective CPU/memory to align and process them?

ADD REPLY • link 5.2 years ago by ATpoint 82k

0

Entering edit mode

Thanks for raising the two potential problems @ATpoint. We have a machine with 250G RAM and 5TB of Hard Drive. However, I wonder if the work is still feasible with these amounts of resources.

ADD REPLY • link 5.1 years ago by robjohn70000 ▴ 150

0

Entering edit mode

5.2 years ago

agata88 ▴ 870

You can try to follow this workflow:

Quality trimming - with Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic)
Mapping reads to human genome - with BWA (http://bio-bwa.sourceforge.net/)
Variant calling - with SAMtools mpileup (http://samtools.sourceforge.net/) or VarScan (http://varscan.sourceforge.net/)
Annotating of detected variants - with SNPEff (http://snpeff.sourceforge.net/)

Try to optimize programs parameters on one or two samples and then run it for the rest of samples.

Best,

Agata

ADD COMMENT • link 5.2 years ago by agata88 ▴ 870

0

Entering edit mode

Please note that samtools mpileup is now deprecated and has been moved to bcftools.

ADD REPLY • link 5.2 years ago by ATpoint 82k

0

Entering edit mode

Thanks for the workflow adivse @agata88. I will take care of @ATpoint point on samtools pileup as well.

ADD REPLY • link 5.1 years ago by robjohn70000 ▴ 150

score 3 · Accepted Answer · 2019-03-01

3

Entering edit mode

5.2 years ago

Pierre Lindenbaum 161k

Use the GATK GVCF way : https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf and https://software.broadinstitute.org/gatk/documentation/article.php?id=3893 : you can create in parallel a *.g.vcf file for each sample and each chromosome and then call GenotypeGVCF for each chromosome and at the end merge the final VCF.

enter image description here

ADD COMMENT • link 5.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I really appreciate the info @Pierre Lindenbaum. Do I need to chuck the files by chromosomes at some stage - not really sure about this. Thanks.

ADD REPLY • link 5.1 years ago by robjohn70000 ▴ 150

0

Entering edit mode

no there is a -L parameter allowing you to analyze a given region: https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_engine_CommandLineGATK.php#--intervals

ADD REPLY • link 5.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks @Pierre Lindenbaum

ADD REPLY • link 5.1 years ago by robjohn70000 ▴ 150

0

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.