Question

Technical challenges inherent to WES processing - any recommended review sources?

0

Entering edit mode

4.4 years ago

asg • 0

Hello everyone!

I am a beginner in bioinformatics, working on getting to understand and build WES data analysis pipelines (we work with human DNA in a clinical setting). I am overwhelmed with a torrent of new terms and esoteric-looking tools.

I realized that before I start experimenting with building a pipeline, I must find a solid way to gauge the validity of whatever VCF data the pipeline would produce.

Unfortunately, my supervisor (not a bioinformatician himself) claims that bioinformaticians, quote, "tend to complicate things unnecessarily". He claims that variant calling from WES data "can't be difficult" since "manual BLAST jobs produce a great alignment for any long enough sequence". So, according to him, all that a WES pipeline should do is run algorithms like BLAST for every read, no problem.

Even from my brief exposure to the literature on WES and NGS in general I get the feeling that this is not nearly as simple. So I need to study sources to, first, understand the complexities myself, — and, second, be able to articulate them to my boss.

Could anyone kindly point me to some at least remotely accessible literature on the technical challenges that WES pipelines typically solve?

Best wishes,

— Alex.

EDIT: Wording.

next-gen wes sequencing • 911 views

ADD COMMENT • link updated 4.4 years ago by Nicolas Rosewick 10k • written 4.4 years ago by asg • 0

0

Entering edit mode

your supervisor is partially right - you may take one of the hundreds pipelines for WES data processing and just run it. However, keep in mind that 1) it require serious computational power, I mean - SERIOUS, 2) it takes quite long to fully analyse one human WES.

There are many technical challenges, mainly caused by mis-alignments, but in general, if you take a ready-to-use solution from some respectable source such as e.g. BROAD - I'd say you may forget them.

ADD REPLY • link 4.4 years ago by German.M.Demidov ★ 2.9k

1

Entering edit mode

We might have different definitions of serious computational power, but for WES any normal workstation will do (if you have few samples) or any standard server node if you have more than a few.

But I definitely agree to use available pipelines instead of reinventing the wheel. In the end you will need to verify those variant that you want to focus on anyway by independent experiments.

By the way, you might tell your supervisor that bioinformatics suffers from the same challenges as the wetlab. Just because someone in the world does something routinely (and after a lot of optimization and finetuning, and with the necessary experience) doesn't mean that it will work right away in your lab once you start setting it up.

ADD REPLY • link 4.4 years ago by ATpoint 82k

1

Entering edit mode

Yeap, agree, I just looked at the description like "my supervisor (not a bioinformatician himself) claims that bioinformaticians, quote, "tend to complicate things unnecessarily". He claims that variant calling from WES data "can't be difficult" " - and made this statement about serious power since I've seen things after such discussions such as no budget for computing at all planned for the large computational project. For bioinformaticians with some money invested in servers/workstations WES analysis is not such a big deal.

ADD REPLY • link 4.4 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

Thank you for the encouragement! Yes, the analogy with “birth pangs” in a wet lab setting is helpful.

ADD REPLY • link 4.2 years ago by asg • 0

score 1 · Answer 1 · 2019-12-05

BLAST is indeed a good alignment tool but was not designed (at all) to handle millions of sequences (reads). It will take weeks to finish... You need to use a specific tools designed for NGS e.g. BWA to align reads ; GATK or samtools to call variants. etc..

You may start looking at GATK best practices (https://software.broadinstitute.org/gatk/best-practices/) that uses :

BWA to align the reads
GATK markDuplicates and BQSR to pre-process the alignment file (bam file)
GATK haplotypecaller to call variants on each individual samples (using the associate bam file)
GATK GenomicsDBimport and GenotyeGVCfs to joint-genotypes all your cohort (using all the called VCFs from the previous step)
GATK VQSR to filter variants

For WES you must provide the interval file of your WES kit (the genomic interval targeted by the kit) at each GATK step (except markduplicates).