Filters/Quality For Vcf Generated By Freebayes
2
10
Entering edit mode
7.5 years ago
Adrian Pelin ★ 2.4k

Hello,

I was reading a few papers citing FreeBayes because I am curious as to what people treat as real variants, and what they do not treat as such.

I myself use illumina and am interested in finding out: A) Is it best to trim your raw reads on the edges for bad quality? B) Remove reads below a certain overall quality? C) Remove PCR duplicates? D) Normally when variants are called, coverage also matters. What are the coverage thresholds (minimum and maximum) to use when determining if a call is real or artefactual? E) Lastly, what is the quality threshold one should consider when treating calls? Where do we draw that line where below it the call is an artifact and above it, it's potentially real?

My own thoughts: For A and B, I am usually against trimming and filtering out RAW reads, because I am afraid of the bias it might introduce. As far as removing PCR duplicates, I think that's a good idea, but it is tricky when dealing with paired end data, because both reads would need to be identical for it to be a PCR duplicate. These guys (1) seemed to remove PCR duplicated, filter and trim. These other guys (2) only trimmed Ns and adaptors, and fitlered only reads with a length smaller than 50, which I think is fine. The authors of the third paper (3) did not preprocess their data in any way.

For D, most people I see are using a minimum DP of 10 or 5 (1), which I am fine with, I think 10 is solid. But what about maximum DP? Either we are sequencing a population and we want to see allele frequency or we are sequencing an individual and want to see heterozygous sites, in any case we should be cautious about gene duplications, so a coverage too high will indicate multiple copies of the gene in the genome, making variant analysis tricky. I am interested in heterozygosity, so what I intend to do is take the average DP for all alleles and setting the upper DP threshold at 1.5x that amount. Opinions on this?

For E I see (1) and (2) used 80 and 100 as the magic line. Here I am really not sure, is 100 stringent enough, or do we need more?

Thank you for your time, Adrian

References: (1) http://www.biomedcentral.com/1471-2164/14/41?utm_source=dlvr.it&utm_medium=twitter (2) http://www.biomedcentral.com/1471-2164/14/64?utm_source=dlvr.it&utm_medium=tumblr (3) http://www.biomedcentral.com/1471-2164/14/467?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Bmc%2FGenomics%2FLatestArticles+%28BMC+Genomics+-+Latest+articles%29

vcf quality • 16k views
ADD COMMENT
0
Entering edit mode

A&B) Bad quality will affect the mapping and also calling, personally, I will perform trimming and filtering if the read quality is bad (or request for re-sequencing from the company)

C) Remove PCR duplicats are important About paired-end sequencing

D) Many people use 250, but I am not really sure about that.

ADD REPLY
0
Entering edit mode

Have you looked into using Variant Quality Score Recalibration (VQSR)?

ADD REPLY
0
Entering edit mode

Just looked into it. Very nice idea, many criteria by which to filter, strand bias, alternate allele position... but it needs a prior, needs a training/truth data set. There would be no place where I could find that unfortunately.

The only thing I have is 6 populations, so some sites will be common among them. Maybe try and use that as a training/truth set.

ADD REPLY
0
Entering edit mode

With what species are you working, just out of curiosity?

ADD REPLY
16
Entering edit mode
7.4 years ago
Erik Garrison ★ 2.3k

minimal pipeline and filtering

You should get pretty far simply by marking duplicates. You may not need to do this if you didn't use PCR amplification in your library preparation, but it shouldn't hurt.

Brad Chapman has worked up a "minimal" pipeline, and he shows that it works basically as well as the much more compute-intensive GATK best-practices http://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/.

He's filtering using DP and QUAL. See the comments in the post and referenced code for examples. In my experience, these should be sufficient. However, if you have some sense of what's true and false from an orthogonal method, you may get better results with a careful tuning of parameters. The allele balance (reported under AB) and strand bias counts (SRF, SRR, SAF, SAR) and bias estimate (SAP) can be used as well, as they are strongly correlated with certain classes of error.

The QUAL estimate provides the phred-scaled probability that the locus is not polymorphic provided the data and the model. This is reasonably-well calibrated, so you can specify that you want things where we expect error rates of no more than 1/100 (QUAL > 20) or 1/1000 (QUAL > 30).

trimming

I would be cautious about aggressively trimming low-quality bases from the ends of the reads. Provided the base quality values represent likely errors, you should do just as well leaving them in the analysis.

Moreover, detecting variants on haplotypes resolves many of the problems with using these lower-quality regions of the reads. Although true sequence variants cluster, errors cluster even more strongly. You can see this if you compare the variance to mean ratio (VMR) of distances between true variants (e.g. 1000G variants) to the distribution of distances of any mismatch between the reference and reads in a set of reads.

Errors being somewhat independent, we should expect that the error-ful ends of reads will generate haplotype observations that do not recur and thus rarely propagate up to the level of detection. We can observe this effect here: http://clavius.bc.edu/~erik/freebayes/figures/indel_error.png. In this figure, we've run the same 191 1000G samples with different --haplotype-length settings for freebayes, and measured the indel length-frequency spectrum to observe the rate of a particular 1bp insertion artifact caused by bubbles in flowcells during the Illumina sequencing process. At smaller window sizes, the chance of recurrence of this artifact in multiple reads is high enough that we detect more 1bp insertions than deletions (we expect slightly fewer given results from de novo assembly). As we increase the window size, the artifacts cluster with nearby errors, generating unique haplotype observations. This improves the signal-to-noise ratio.

In summary, I wouldn't trim unless you observe serious problems with the last cycles in your run, and these generate lots of false calls.

removing low-quality reads and bases

Again, I wouldn't remove reads with low mapping quality, or observations with "low" base quality unless you observe very strong systematic problems with your sequencing data. The basic idea is that the method benefits from additional information. Even low-quality information is helpful.

In the pipelines I'm running now, for Illumina data I typically provide the following to freebayes:

--min-base-quality 3
--min-mapping-quality 1

This is because (as I understand) base qualities below 3 act as flags rather than true qualities. Also, mapping quality 0 may mean "multiply mapped." I noticed these settings helped our results slightly.

caveats

Your specific experiment may require high sensitivity or high specificity. To that extent, you should choose appropriate filter settings based on your understanding of the problem. Hopefully this helps shed some light on your options.

ADD COMMENT
0
Entering edit mode

Thank you, this is very helpful.

Since we have 6 samples, variants that are shared at least between 2 samples, I would not filter them out based on anything, because you would not expect the same error to appear in the same locus in 2 independent samples. What are your thoughts?

I could expand my analysis by including an addition 7th sample that was sequenced by 454 in low coverage, about 30x. Since this is not illumina like my 6 other samples, using your mentioned parameters should help.

For PCR duplicates, in some samples it found 20% duplicates... so was not a bad idea to run that.

I am going to take your advice and avoid trimming/fitlering reads, since FastQC analysis does not show any systematic bias.

ADD REPLY
0
Entering edit mode

I am a bit confused about the way to filter PCR duplicates. You seem to advice filtering by QUAL but not with the DUP marker. Similarly, in the Freebayes readme page, on paragraph "Calling variants: from fastq to VCF", the tutorial advice to mark duplicates, then to run Freebayes directly.

https://github.com/ekg/freebayes/blob/master/README.md

Is it really the case? Is Freebayes taking in account the marked duplicates? Or should we filter clonality (remove duplicates from the BAM file)?

ADD REPLY
0
Entering edit mode

Ah, found the answer to my question:

Reads marked as duplicates in the BAM file are ignored, but this can be disabled for testing purposes by providing --use-duplicate-reads.
ADD REPLY
0
Entering edit mode
7.4 years ago
Adrian Pelin ★ 2.4k

When mapping paired-end data to a genome, would anyone consider only mapping one read if the paired mate as well maps? In other words, Read/1 can only map if Read/2 also maps within the required distance?

ADD COMMENT
0
Entering edit mode

Most read mappers do something like this, but rather than not aligning anything, the aligner will typically use flags to let the user know the status of the mapping pair.

ADD REPLY

Login before adding your answer.

Traffic: 2133 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6