Question

How to validate the OVERALL quality of a VCF/BCF file (Data munging)

0

Entering edit mode

6.4 years ago

rightmirem ▴ 70

I've run samtools mpileup and bcftools calls against some entire genome data I was handed.

I'm looking to validate the data. I've been reading a lot on how to validate individual SNP calls, etc (using QUAL, GQ, etc.).

One thing I'm unclear on is; how to validate that the overall data quality, software run, and filters are really valid.

For example, I've been working from assumption like:

If the BAM file or even the VCF/BCF files have a very low, or a very broad curve for the DEPTH...the entire data file may be invalid and throwing it out should be considered.
If the INDEL lengths, QUAL, or GQ do not fall in a relatively normal bell curve, the data may be invalid.

QUESTION:

Are those assumption (above ) useful?
What other benchmarks/indicators should I be using to validate that the entire run is of high quality (versus validating individual SNP quality)?)

snp SNP next-gen validation munging • 2.2k views

ADD COMMENT • link updated 6.4 years ago by Kevin Blighe 87k • written 6.4 years ago by rightmirem ▴ 70

1

Entering edit mode

Adding to Kevin's comment, an unbiased evaluation would require large-scale confirmation by a second sequencing approach, e.g. traditional Sanger, but this is of course not feasible for most researchers, especially if the variant calling is only one of many tasks to be done during the project (and of course only possible if you created the data yourself, aka have the DNA in your freezer, rather then having downloaded it from a database). Therefore, you'll need to rely on the variant callers recommended settings (and only change the defaults if you have expert knowledge), as these should be derived from exactly these extensive validation efforts. Are you working on human data?

ADD REPLY • link 6.4 years ago by ATpoint 81k

0

Entering edit mode

Thanks for the reply. I'm working on getting some of the data Kevin recommended. To answer your questions: - This is data from another lab (we don't have the DNA in our freezer). - It is human data (70 individual's whole genome) - The tools were run (to the best of my knowledge) using the "standard settings"...although even this seems a bit hard to find what standard recommendations are.

I'll include some of this info in my reply to Kevin :)

ADD REPLY • link 6.4 years ago by rightmirem ▴ 70

0

Entering edit mode

Yes, because there is standardisation in neither bioinformatics nor NGS. They must mean 'standard' in terms of their own in house laboratory settings.

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

score 2 · Answer 1 · 2017-11-05

This is, I'm afraid, one of those open-ended questions whereby you'll get 2 different answers if you ask 2 different people.

From just looking at the VCF, it's next-to-impossible to relate quality to the entire sequencing run because a VCF may or may not be heavily filtered along the way. If you're lucky, and standard tools were used, then at least all filtering applied to the 'raw' VCF will be recorded in the VCF header, but even a 'raw' VCF may have been generated from a heavily filtered BAM/SAM and thus conceal much information about the overall run and quality of data.

For example, I've been working from assumption like:

If the BAM file or even the VCF/BCF files have a very low, or a very broad curve for the DEPTH...the entire data file may be invalid and throwing it out should be considered.

If the INDEL lengths, QUAL, or GQ do not fall in a relatively normal bell curve, the data may be invalid.

'Very broad curve' is vague but I guess that you mean generally uneven depth of coverage? It would be incorrect to automatically assume that a sample was poor quality by just looking at this. The depth of coverage profile can be influenced by one or more of the following:

target depth of coverage (obvious)
difficulty in priming due to high GC content
sequence similarity [to other regions of the genome]
outdated reagents
degraded DNA
delays in the wet-laboratory processing of the sample
et cetera.

Thus, there are many 'parameters' that go into the depth of coverage 'equation', and I believe that variations in depth of coverage are expected. You haven't elaborated on whether what you're observing is extreme variations in the profile or not(?).

I'm not sure that the indel profile should necessarily fall into a bell curve profile, and neither that of the QUAL nor GQ. There's a lot that indirectly goes into the calculation of these (QUAL and GQ), and sometimes the assigned values don't even make sense.

The sense of 'quality' for a sample and run is more a human feeling that should come from looking at a whole host of parameters. In order to make an honest decision on whether a run failed or not, I would love to see:

Wet lab

DNA concentration
Gel electrophoresis of DNA
Length of time DNA was in transit
Date the reagent kit was produced (and expiry date)

Sequencing

Sequencer type and maintenance records
Target depth of coverage
Passed filter reads
% Q10 bases
% Q20 bases
% Q30 bases

Bioinformatics

All programs and versions used to process the data, including the base-caller in the sequencer
Total reads
Min/Max/Mean/Median/Upper-/Lower-quartile read length
Any QC or trimming applied to reads
Alignment % to reference genome
Genome version used for alignment
Mate-pairs mapped together
Reads aligned to >1 location
Singletons / Lone mates

Bioinformatics coverage and other QC

Number of reads off target (targeted sequencing only)
Min/Max/Mean/Median/Upper-/Lower-quartile read-depth per chromosome and genome-wide
Plot of depth of coverage profile in bins (e.g. 50,000 bp) per chromosome and genome-wide
Bases with 0, <5, et cetera read-depth (and then summarised into regions that have same read-depth at each level)
Overall % genome covered at read depth 1, 2, 3, 4, 5, 10, 18, 20, 30, et cetera.

Variant calling (to produce a VCF)

MAPQ filters
MAPQ bias
Phred-scaled base quality filters
Base quality bias
Strand bias
Downsampling performed (to what level?)
Min number of variant bases required to make a variant call
Min allelic fraction at which to call heterozygous/homozygous variant
Read-end bias
Min total (ref+alt) read depth at which a variant is even reported in the VCF

This is just off the top of my head. The list is exhaustive...

Kevin