Hard filtering results of vcf file for multiple samps
1
0
Entering edit mode
5.7 years ago
seta ★ 1.9k

Hi all,

Assuming, there are multiple samples, say 500 samples, which used for variant calling by HaplotypeCaller (GATK) and joint genotyping to produce the final vcf file. Now, hard-filtering can be applied; following hard-filtering of variants, the "Filter column" is added to the vcf file, implying which variant PASS or FILTER, which PASS variants should be used for downstream analysis as I read. I think, the FILTER means that variant(s) filtered in all 500 samples, is it right? however, I cannot understand how it can happen, how is possible the variant X had the low quality in all samples? Could you please clear me on this issue?

Many thanks in advance

variant calling GATK hard filtering • 4.1k views
ADD COMMENT
0
Entering edit mode

Does hard filtering not mean that non-PASS entries will be removed? Is VCF filter not a soft filter which marks but doesn't remove?

I think your filter defines what it means. The way I understand FILTER is: you can have a filter F = N% of samples have attribute X that is of the sort Y, and samples that fail that will be marked with that filter flag.

ADD REPLY
0
Entering edit mode

Hi Ram,

Thanks for the comment, so it is logical to use variants marked with FILTER flag for further analysis, as you mentioned, yes? so, could you please kindly tell me why this filtering is applied, what is its benefit for the work? I'm looking for a guide for filtering the variants derived from whole-genome sequencing of a given population. Any suggestion would be highly appreciated.

ADD REPLY
0
Entering edit mode

I think finswimmer's answer below covers your questions. Please check out the links and let us know if you have any further questions.

ADD REPLY
1
Entering edit mode
5.7 years ago

Hello seta,

you sound a little bit confused about this topic. Let's try to solve this :)

Filtering after variant calling is done to remove false-positive variants. If you follow the GATK pipeline there are two ways:

  1. VQSR: This is only applicable in larger sequencing project
  2. manual filtering based on specified criteria

Manual filtering can be divided in hard-filtering (meaning variants will be removed) and soft-filtering (meaning variants will be kept and flagged)

following hard-filtering of variants, the "Filter column" is added to the vcf file

Strictly spoken this is not correct. The Filter column is already there, as it is a mandatory column (see specs). Depending on the variant caller the values in this column is set to . or PASS. You can now define criteria of which you believe, that a variant in your vcf file is a false positive. Using soft-filtering you can add a name to the filter column, to see why you believe that this variant isn't true. Normally you would than go on in your downstream analysis only with those you have the flag PASS (meaning doesn't match any soft-filtering criteria).

Which criteria you should use to find false-positive? That's the holy grail in finding variants. Nevertheless GATK has a few recommendations on where to start. But don't believe this is a gold-standard. You still will have false-positives in your list and maybe filtered out some true-positives.

If you already have something other than PASS or . in your vcf look into the header. There should be description of what each filter name means.

fin swimmer

ADD COMMENT
0
Entering edit mode

Hi fin swimmer,

Many thanks for your nice explanation. As I encounter a high diversity population, I think it's better to use manual filtering, not VQSR. As I read about hard-filtering, the variants do not pass the pre-defined threshold will be tagged as FILTER and does not remove, is it right? as I mentioned in my post, I want to know FILTER means that variant(s) filtered in all samples, is it right? so, they really should ignore for further analysis as you also mentioned in your reply, yes? I'm still confused how is possible the variant(s) had the low quality in all samples?

ADD REPLY
2
Entering edit mode

As I read about hard-filtering, the variants do not pass the pre-defined threshold will be tagged as FILTER and does not remove, is it right?

Whether "hard-filtering" means "tagged by a flag in the FILTER column" or "remove" depends on with whom you talk. GATK uses this term to delimit it from VQSR and means tagging. bcftools remove variants that do not pass the treshold unless you set the flag for soft-filtering. Then it's tagging the variant with a flag in the FILTER column as well.

as I mentioned in my post, I want to know FILTER means that variant(s) filtered in all samples, is it right?

This depend on how your vcf(s) look like and what your filtering criteria are. If you have one vcf per sample than it might be that in one file a variant is filtered but in the other not. If you have multiple samples in one file than the flag is set for the variant line and therefor for all samples.

so, they really should ignore for further analysis as you also mentioned in your reply, yes?

There are different strategies and they depend on the size of your dataset, the goal, .... One can first remove variants which are suspected as false-positive and look for candidate variants then. Or one go the other way round. First look for candidate variants and filter out then which might be false positive.

I'm still confused how is possible the variant(s) had the low quality in all samples?

As always: it depends. For example due to pseudogenes or other homologous the mapping quality of the reads could have a low mapping quality. This is independend of the sample. There could be techniqual biases which lead to low base quality and wrong bases towards the end of reads. You maybe filter out low coverage regions (think of gc rich region). These are all examples for things you see in all samples.

fin swimmer

ADD REPLY
0
Entering edit mode

Thank you very much fin swimmer for getting back to the post. I use GATK for the work. Actually, there are about 1200 samples with one vcf file, so filtered variant is for all samples as you kindly explained. Based on your post, there are several reasons why the variant doesn't pass the quality, however, how someone can check if the filtered variants are really low-quality in all samples, say 1200 samples? Given the sample size, in addition to hard-filtering in GATK and considering just PASS variants, could you please kindly tell me which criteria (filters) should apply on the vcf file (one vcf file for all samples) containing PASS variants before further analysis?

ADD REPLY
0
Entering edit mode

Thanks, I read them and let you know.

ADD REPLY

Login before adding your answer.

Traffic: 2854 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6