Question: Hard filtering results of vcf file for multiple samps
0
gravatar for seta
14 months ago by
seta1.2k
Sweden
seta1.2k wrote:

Hi all,

Assuming, there are multiple samples, say 500 samples, which used for variant calling by HaplotypeCaller (GATK) and joint genotyping to produce the final vcf file. Now, hard-filtering can be applied; following hard-filtering of variants, the "Filter column" is added to the vcf file, implying which variant PASS or FILTER, which PASS variants should be used for downstream analysis as I read. I think, the FILTER means that variant(s) filtered in all 500 samples, is it right? however, I cannot understand how it can happen, how is possible the variant X had the low quality in all samples? Could you please clear me on this issue?

Many thanks in advance

ADD COMMENTlink modified 14 months ago by finswimmer12k • written 14 months ago by seta1.2k

Does hard filtering not mean that non-PASS entries will be removed? Is VCF filter not a soft filter which marks but doesn't remove?

I think your filter defines what it means. The way I understand FILTER is: you can have a filter F = N% of samples have attribute X that is of the sort Y, and samples that fail that will be marked with that filter flag.

ADD REPLYlink written 14 months ago by RamRS24k

Hi Ram,

Thanks for the comment, so it is logical to use variants marked with FILTER flag for further analysis, as you mentioned, yes? so, could you please kindly tell me why this filtering is applied, what is its benefit for the work? I'm looking for a guide for filtering the variants derived from whole-genome sequencing of a given population. Any suggestion would be highly appreciated.

ADD REPLYlink written 14 months ago by seta1.2k

I think finswimmer's answer below covers your questions. Please check out the links and let us know if you have any further questions.

ADD REPLYlink modified 14 months ago • written 14 months ago by RamRS24k
0
gravatar for finswimmer
14 months ago by
finswimmer12k
Germany
finswimmer12k wrote:

Hello seta,

you sound a little bit confused about this topic. Let's try to solve this :)

Filtering after variant calling is done to remove false-positive variants. If you follow the GATK pipeline there are two ways:

  1. VQSR: This is only applicable in larger sequencing project
  2. manual filtering based on specified criteria

Manual filtering can be divided in hard-filtering (meaning variants will be removed) and soft-filtering (meaning variants will be kept and flagged)

following hard-filtering of variants, the "Filter column" is added to the vcf file

Strictly spoken this is not correct. The Filter column is already there, as it is a mandatory column (see specs). Depending on the variant caller the values in this column is set to . or PASS. You can now define criteria of which you believe, that a variant in your vcf file is a false positive. Using soft-filtering you can add a name to the filter column, to see why you believe that this variant isn't true. Normally you would than go on in your downstream analysis only with those you have the flag PASS (meaning doesn't match any soft-filtering criteria).

Which criteria you should use to find false-positive? That's the holy grail in finding variants. Nevertheless GATK has a few recommendations on where to start. But don't believe this is a gold-standard. You still will have false-positives in your list and maybe filtered out some true-positives.

If you already have something other than PASS or . in your vcf look into the header. There should be description of what each filter name means.

fin swimmer

ADD COMMENTlink modified 14 months ago by RamRS24k • written 14 months ago by finswimmer12k

Hi fin swimmer,

Many thanks for your nice explanation. As I encounter a high diversity population, I think it's better to use manual filtering, not VQSR. As I read about hard-filtering, the variants do not pass the pre-defined threshold will be tagged as FILTER and does not remove, is it right? as I mentioned in my post, I want to know FILTER means that variant(s) filtered in all samples, is it right? so, they really should ignore for further analysis as you also mentioned in your reply, yes? I'm still confused how is possible the variant(s) had the low quality in all samples?

ADD REPLYlink written 14 months ago by seta1.2k
1

As I read about hard-filtering, the variants do not pass the pre-defined threshold will be tagged as FILTER and does not remove, is it right?

Whether "hard-filtering" means "tagged by a flag in the FILTER column" or "remove" depends on with whom you talk. GATK uses this term to delimit it from VQSR and means tagging. bcftools remove variants that do not pass the treshold unless you set the flag for soft-filtering. Then it's tagging the variant with a flag in the FILTER column as well.

as I mentioned in my post, I want to know FILTER means that variant(s) filtered in all samples, is it right?

This depend on how your vcf(s) look like and what your filtering criteria are. If you have one vcf per sample than it might be that in one file a variant is filtered but in the other not. If you have multiple samples in one file than the flag is set for the variant line and therefor for all samples.

so, they really should ignore for further analysis as you also mentioned in your reply, yes?

There are different strategies and they depend on the size of your dataset, the goal, .... One can first remove variants which are suspected as false-positive and look for candidate variants then. Or one go the other way round. First look for candidate variants and filter out then which might be false positive.

I'm still confused how is possible the variant(s) had the low quality in all samples?

As always: it depends. For example due to pseudogenes or other homologous the mapping quality of the reads could have a low mapping quality. This is independend of the sample. There could be techniqual biases which lead to low base quality and wrong bases towards the end of reads. You maybe filter out low coverage regions (think of gc rich region). These are all examples for things you see in all samples.

fin swimmer

ADD REPLYlink modified 14 months ago • written 14 months ago by finswimmer12k

Thank you very much fin swimmer for getting back to the post. I use GATK for the work. Actually, there are about 1200 samples with one vcf file, so filtered variant is for all samples as you kindly explained. Based on your post, there are several reasons why the variant doesn't pass the quality, however, how someone can check if the filtered variants are really low-quality in all samples, say 1200 samples? Given the sample size, in addition to hard-filtering in GATK and considering just PASS variants, could you please kindly tell me which criteria (filters) should apply on the vcf file (one vcf file for all samples) containing PASS variants before further analysis?

ADD REPLYlink written 14 months ago by seta1.2k

Hello seta,

do these two documents help?

ADD REPLYlink written 14 months ago by finswimmer12k

Thanks, I read them and let you know.

ADD REPLYlink written 14 months ago by seta1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1835 users visited in the last hour