Question: Filters/Quality For Vcf Generated By Freebayes
gravatar for Adrian Pelin
6.8 years ago by
Adrian Pelin2.4k
Adrian Pelin2.4k wrote:


I was reading a few papers citing FreeBayes because I am curious as to what people treat as real variants, and what they do not treat as such.

I myself use illumina and am interested in finding out: A) Is it best to trim your raw reads on the edges for bad quality? B) Remove reads below a certain overall quality? C) Remove PCR duplicates? D) Normally when variants are called, coverage also matters. What are the coverage thresholds (minimum and maximum) to use when determining if a call is real or artefactual? E) Lastly, what is the quality threshold one should consider when treating calls? Where do we draw that line where below it the call is an artifact and above it, it's potentially real?

My own thoughts: For A and B, I am usually against trimming and filtering out RAW reads, because I am afraid of the bias it might introduce. As far as removing PCR duplicates, I think that's a good idea, but it is tricky when dealing with paired end data, because both reads would need to be identical for it to be a PCR duplicate. These guys (1) seemed to remove PCR duplicated, filter and trim. These other guys (2) only trimmed Ns and adaptors, and fitlered only reads with a length smaller than 50, which I think is fine. The authors of the third paper (3) did not preprocess their data in any way.

For D, most people I see are using a minimum DP of 10 or 5 (1), which I am fine with, I think 10 is solid. But what about maximum DP? Either we are sequencing a population and we want to see allele frequency or we are sequencing an individual and want to see heterozygous sites, in any case we should be cautious about gene duplications, so a coverage too high will indicate multiple copies of the gene in the genome, making variant analysis tricky. I am interested in heterozygosity, so what I intend to do is take the average DP for all alleles and setting the upper DP threshold at 1.5x that amount. Opinions on this?

For E I see (1) and (2) used 80 and 100 as the magic line. Here I am really not sure, is 100 stringent enough, or do we need more?

Thank you for your time, Adrian

References: (1) (2) (3)

vcf quality • 15k views
ADD COMMENTlink modified 6.7 years ago • written 6.8 years ago by Adrian Pelin2.4k

A&B) Bad quality will affect the mapping and also calling, personally, I will perform trimming and filtering if the read quality is bad (or request for re-sequencing from the company)

C) Remove PCR duplicats are important About paired-end sequencing

D) Many people use 250, but I am not really sure about that.

ADD REPLYlink modified 8 months ago by RamRS28k • written 6.8 years ago by Sam3.0k

Have you looked into using Variant Quality Score Recalibration (VQSR)?

ADD REPLYlink written 6.8 years ago by Sean Davis26k

Just looked into it. Very nice idea, many criteria by which to filter, strand bias, alternate allele position... but it needs a prior, needs a training/truth data set. There would be no place where I could find that unfortunately.

The only thing I have is 6 populations, so some sites will be common among them. Maybe try and use that as a training/truth set.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Adrian Pelin2.4k

With what species are you working, just out of curiosity?

ADD REPLYlink written 6.8 years ago by Sean Davis26k
gravatar for Erik Garrison
6.7 years ago by
Erik Garrison2.3k
Napoli, IT / UCSC
Erik Garrison2.3k wrote:

minimal pipeline and filtering

You should get pretty far simply by marking duplicates. You may not need to do this if you didn't use PCR amplification in your library preparation, but it shouldn't hurt.

Brad Chapman has worked up a "minimal" pipeline, and he shows that it works basically as well as the much more compute-intensive GATK best-practices

He's filtering using DP and QUAL. See the comments in the post and referenced code for examples. In my experience, these should be sufficient. However, if you have some sense of what's true and false from an orthogonal method, you may get better results with a careful tuning of parameters. The allele balance (reported under AB) and strand bias counts (SRF, SRR, SAF, SAR) and bias estimate (SAP) can be used as well, as they are strongly correlated with certain classes of error.

The QUAL estimate provides the phred-scaled probability that the locus is not polymorphic provided the data and the model. This is reasonably-well calibrated, so you can specify that you want things where we expect error rates of no more than 1/100 (QUAL > 20) or 1/1000 (QUAL > 30).


I would be cautious about aggressively trimming low-quality bases from the ends of the reads. Provided the base quality values represent likely errors, you should do just as well leaving them in the analysis.

Moreover, detecting variants on haplotypes resolves many of the problems with using these lower-quality regions of the reads. Although true sequence variants cluster, errors cluster even more strongly. You can see this if you compare the variance to mean ratio (VMR) of distances between true variants (e.g. 1000G variants) to the distribution of distances of any mismatch between the reference and reads in a set of reads.

Errors being somewhat independent, we should expect that the error-ful ends of reads will generate haplotype observations that do not recur and thus rarely propagate up to the level of detection. We can observe this effect here: In this figure, we've run the same 191 1000G samples with different --haplotype-length settings for freebayes, and measured the indel length-frequency spectrum to observe the rate of a particular 1bp insertion artifact caused by bubbles in flowcells during the Illumina sequencing process. At smaller window sizes, the chance of recurrence of this artifact in multiple reads is high enough that we detect more 1bp insertions than deletions (we expect slightly fewer given results from de novo assembly). As we increase the window size, the artifacts cluster with nearby errors, generating unique haplotype observations. This improves the signal-to-noise ratio.

In summary, I wouldn't trim unless you observe serious problems with the last cycles in your run, and these generate lots of false calls.

removing low-quality reads and bases

Again, I wouldn't remove reads with low mapping quality, or observations with "low" base quality unless you observe very strong systematic problems with your sequencing data. The basic idea is that the method benefits from additional information. Even low-quality information is helpful.

In the pipelines I'm running now, for Illumina data I typically provide the following to freebayes:

--min-base-quality 3
--min-mapping-quality 1

This is because (as I understand) base qualities below 3 act as flags rather than true qualities. Also, mapping quality 0 may mean "multiply mapped." I noticed these settings helped our results slightly.


Your specific experiment may require high sensitivity or high specificity. To that extent, you should choose appropriate filter settings based on your understanding of the problem. Hopefully this helps shed some light on your options.

ADD COMMENTlink written 6.7 years ago by Erik Garrison2.3k

Thank you, this is very helpful.

Since we have 6 samples, variants that are shared at least between 2 samples, I would not filter them out based on anything, because you would not expect the same error to appear in the same locus in 2 independent samples. What are your thoughts?

I could expand my analysis by including an addition 7th sample that was sequenced by 454 in low coverage, about 30x. Since this is not illumina like my 6 other samples, using your mentioned parameters should help.

For PCR duplicates, in some samples it found 20% duplicates... so was not a bad idea to run that.

I am going to take your advice and avoid trimming/fitlering reads, since FastQC analysis does not show any systematic bias.

ADD REPLYlink written 6.7 years ago by Adrian Pelin2.4k

I am a bit confused about the way to filter PCR duplicates. You seem to advice filtering by QUAL but not with the DUP marker. Similarly, in the Freebayes readme page, on paragraph "Calling variants: from fastq to VCF", the tutorial advice to mark duplicates, then to run Freebayes directly.

Is it really the case? Is Freebayes taking in account the marked duplicates? Or should we filter clonality (remove duplicates from the BAM file)?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Begonia_pavonina20

Ah, found the answer to my question:

Reads marked as duplicates in the BAM file are ignored, but this can be disabled for testing purposes by providing --use-duplicate-reads.
ADD REPLYlink written 4 weeks ago by Begonia_pavonina20
gravatar for Adrian Pelin
6.7 years ago by
Adrian Pelin2.4k
Adrian Pelin2.4k wrote:

When mapping paired-end data to a genome, would anyone consider only mapping one read if the paired mate as well maps? In other words, Read/1 can only map if Read/2 also maps within the required distance?

ADD COMMENTlink written 6.7 years ago by Adrian Pelin2.4k

Most read mappers do something like this, but rather than not aligning anything, the aligner will typically use flags to let the user know the status of the mapping pair.

ADD REPLYlink written 6.7 years ago by Sean Davis26k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 668 users visited in the last hour