Question: How to validate the OVERALL quality of a VCF/BCF file (Data munging)
0
gravatar for rightmirem
2.3 years ago by
rightmirem70
rightmirem70 wrote:

I've run samtools mpileup and bcftools calls against some entire genome data I was handed.

I'm looking to validate the data. I've been reading a lot on how to validate individual SNP calls, etc (using QUAL, GQ, etc.).

One thing I'm unclear on is; how to validate that the overall data quality, software run, and filters are really valid.

For example, I've been working from assumption like:

  • If the BAM file or even the VCF/BCF files have a very low, or a very broad curve for the DEPTH...the entire data file may be invalid and throwing it out should be considered.

  • If the INDEL lengths, QUAL, or GQ do not fall in a relatively normal bell curve, the data may be invalid.

QUESTION:

  1. Are those assumption (above ) useful?

  2. What other benchmarks/indicators should I be using to validate that the entire run is of high quality (versus validating individual SNP quality)?)

validation munging snp next-gen • 902 views
ADD COMMENTlink modified 2.3 years ago by Kevin Blighe54k • written 2.3 years ago by rightmirem70
1

Adding to Kevin's comment, an unbiased evaluation would require large-scale confirmation by a second sequencing approach, e.g. traditional Sanger, but this is of course not feasible for most researchers, especially if the variant calling is only one of many tasks to be done during the project (and of course only possible if you created the data yourself, aka have the DNA in your freezer, rather then having downloaded it from a database). Therefore, you'll need to rely on the variant callers recommended settings (and only change the defaults if you have expert knowledge), as these should be derived from exactly these extensive validation efforts. Are you working on human data?

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by ATpoint29k

Thanks for the reply. I'm working on getting some of the data Kevin recommended. To answer your questions: - This is data from another lab (we don't have the DNA in our freezer). - It is human data (70 individual's whole genome) - The tools were run (to the best of my knowledge) using the "standard settings"...although even this seems a bit hard to find what standard recommendations are.

I'll include some of this info in my reply to Kevin :)

ADD REPLYlink written 2.3 years ago by rightmirem70

Yes, because there is standardisation in neither bioinformatics nor NGS. They must mean 'standard' in terms of their own in house laboratory settings.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Kevin Blighe54k
2
gravatar for Kevin Blighe
2.3 years ago by
Kevin Blighe54k
Kevin Blighe54k wrote:

This is, I'm afraid, one of those open-ended questions whereby you'll get 2 different answers if you ask 2 different people.

From just looking at the VCF, it's next-to-impossible to relate quality to the entire sequencing run because a VCF may or may not be heavily filtered along the way. If you're lucky, and standard tools were used, then at least all filtering applied to the 'raw' VCF will be recorded in the VCF header, but even a 'raw' VCF may have been generated from a heavily filtered BAM/SAM and thus conceal much information about the overall run and quality of data.

For example, I've been working from assumption like:

  • If the BAM file or even the VCF/BCF files have a very low, or a very broad curve for the DEPTH...the entire data file may be invalid and throwing it out should be considered.
  • If the INDEL lengths, QUAL, or GQ do not fall in a relatively normal bell curve, the data may be invalid.

'Very broad curve' is vague but I guess that you mean generally uneven depth of coverage? It would be incorrect to automatically assume that a sample was poor quality by just looking at this. The depth of coverage profile can be influenced by one or more of the following:

  • target depth of coverage (obvious)
  • difficulty in priming due to high GC content
  • sequence similarity [to other regions of the genome]
  • outdated reagents
  • degraded DNA
  • delays in the wet-laboratory processing of the sample
  • et cetera.

Thus, there are many 'parameters' that go into the depth of coverage 'equation', and I believe that variations in depth of coverage are expected. You haven't elaborated on whether what you're observing is extreme variations in the profile or not(?).

I'm not sure that the indel profile should necessarily fall into a bell curve profile, and neither that of the QUAL nor GQ. There's a lot that indirectly goes into the calculation of these (QUAL and GQ), and sometimes the assigned values don't even make sense.


The sense of 'quality' for a sample and run is more a human feeling that should come from looking at a whole host of parameters. In order to make an honest decision on whether a run failed or not, I would love to see:

Wet lab

  • DNA concentration

  • Gel electrophoresis of DNA

  • Length of time DNA was in transit

  • Date the reagent kit was produced (and expiry date)

Sequencing

  • Sequencer type and maintenance records

  • Target depth of coverage

  • Passed filter reads

  • % Q10 bases

  • % Q20 bases

  • % Q30 bases

Bioinformatics

  • All programs and versions used to process the data, including the base-caller in the sequencer

  • Total reads

  • Min/Max/Mean/Median/Upper-/Lower-quartile read length

  • Any QC or trimming applied to reads

  • Alignment % to reference genome

  • Genome version used for alignment

  • Mate-pairs mapped together

  • Reads aligned to >1 location

  • Singletons / Lone mates

Bioinformatics coverage and other QC

  • Number of reads off target (targeted sequencing only)

  • Min/Max/Mean/Median/Upper-/Lower-quartile read-depth per chromosome and genome-wide

  • Plot of depth of coverage profile in bins (e.g. 50,000 bp) per chromosome and genome-wide

  • Bases with 0, <5, et cetera read-depth (and then summarised into regions that have same read-depth at each level)

  • Overall % genome covered at read depth 1, 2, 3, 4, 5, 10, 18, 20, 30, et cetera.

Variant calling (to produce a VCF)

  • MAPQ filters

  • MAPQ bias

  • Phred-scaled base quality filters

  • Base quality bias

  • Strand bias

  • Downsampling performed (to what level?)

  • Min number of variant bases required to make a variant call

  • Min allelic fraction at which to call heterozygous/homozygous variant

  • Read-end bias

  • Min total (ref+alt) read depth at which a variant is even reported in the VCF


This is just off the top of my head. The list is exhaustive...

Kevin

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Kevin Blighe54k
1

Hi Kevin,

I appreciate your thoughtful answer, and I'm in the process of obtaining some of that requested information. I'll touch base again when I have it. I just wanted to let you know I appreciated the comments.

Best! Mike

ADD REPLYlink written 2.3 years ago by rightmirem70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 797 users visited in the last hour