Question: Bcftools view. FREQ inclusion error.
0
gravatar for tcom111
4 months ago by
tcom11110
tcom11110 wrote:

Dear all, I designed a scipt (running from Jupyter notebook, Python3) that takes vcf files of choice (by some string filters) and generates their gzipped version after filtering using BCFTOOL view -i * -t** supplied by the user. When using an inclusion of depth e.g DP > 100, the script works and filters all instances <=100.

Example:

bcftools view -Oz /path/file.vcf -o /path/file.vcf.gz  -i 'DP>100' -t 5:25000000-25001000

*This yield the expected outcome. A filtered VCF file.

However I get errors when trying FREQ > 0.1 instead:

Example

bcftools view -Oz /path/file.vcf -o /path/file.vcf.gz   -i 'FREQ>0.1' -t 5:25000000-25001000
Wrong operator in string comparison: FREQ > 0.3  [(null),0.39%]
[E::main_vcfindex] unknown filetype; expected bgzip compressed VCF or BCF
[E::main_vcfindex] was the VCF/BCF compressed with bgzip?

I've tried many instances of this line but nothing worked (e.g. -i FMT/FREQ>0.1). Does someone know how to solve this issue? (I originally planned to run both inclusion together with an "&" operator.

Thanks!

ADD COMMENTlink modified 4 months ago • written 4 months ago by tcom11110

Does your VCF even have the FREQ variable encoded? Do you not mean AF (allele frequency)?

ADD REPLYlink written 4 months ago by Kevin Blighe28k

Yes, Here is the format description in the vcf file: GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR

ADD REPLYlink written 4 months ago by tcom11110

Can you follow-up on the answer given by sm.hashemin? My time is currently very limited.

Is FREQ defined in your VCF header? Can you paste a few sample variants?

ADD REPLYlink written 4 months ago by Kevin Blighe28k
3
gravatar for sm.hashemin
4 months ago by
sm.hashemin60
sm.hashemin60 wrote:

Hi there. You must define FREQ as

FORMAT= ID=FREQ,Number=.,Type=Float,Description=...

in the VCF header! As long as FREQ is an integer you can not do arithmetic stuff with it! Besto

ADD COMMENTlink modified 4 months ago by Kevin Blighe28k • written 4 months ago by sm.hashemin60

Dear sm.hashemin, Thanks for your reply. Currently the header looks as follows:

##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">

I switched the "String" to "Float" and also tried replacing the "1" to "." I doesn't work and also generates this error during the generation of the filtered file using the bcftools view command:

[E::vcf_parse_format] Invalid character '%' in 'FREQ' FORMAT field at  ....

Should I omit the "<" as well? Thanks again.

ADD REPLYlink modified 4 months ago by Kevin Blighe28k • written 4 months ago by tcom11110
1

Hello tcom111,

please use the code button (the one with 101010) to format code. That makes your post more readable.

To your problem:

  • There are to many " in your header line. Look at the value of description. That looks weird.
  • Take care of upper and lower cases. freq is not the same as FREQ.

So the correct header line should be like this:

##FORMAT=<ID=FREQ,Number=1,Type=Float,Description="variant allele frequency">

fin swimmer

ADD REPLYlink modified 4 months ago • written 4 months ago by finswimmer5.4k

tcom111, can you please try the suggestions by both finswimmer and sm.hashemin and then please report back here about what has actually solved your problem?

ADD REPLYlink written 4 months ago by Kevin Blighe28k

Hello Kevin,

have you edited tcomm111 post to correct maybe some wrong parsing through to forum software? At the time of writing my comment, the header line that tcom111 presented was written in lower cases and there were to many " Now it isn't anymore and my comment made little sense.

fin swimmer

ADD REPLYlink written 4 months ago by finswimmer5.4k

I did edit some posts in this thread in response to what you had said, finswimmer, in relation to wrapping text/code via the 101 010 button. When we use this function, however, some formatting can become hidden or lost, including quotation marks and parentheses positioned in a certain way.

I would not worry about some of your comments becoming redundant. This is common on this website. Comments frequently look out of place.

ADD REPLYlink written 4 months ago by Kevin Blighe28k

To clarify: I only edited tcom111's posts by wrapping certain text with the 101 010 button. tcom111 her/himself may have edited further things.

ADD REPLYlink written 4 months ago by Kevin Blighe28k

Hi there try using the

 bcftools filter --include 'FREQ>0.1' inputfile.vcf --output outputfile.vcf

hope in works Best Mo

ADD REPLYlink modified 4 months ago by Kevin Blighe28k • written 4 months ago by sm.hashemin60
2
gravatar for tcom111
4 months ago by
tcom11110
tcom11110 wrote:

[[Problem solved]]

Dear all,

Same as before, I switched to Float in accordance with finswimmer's suggestion:

##FORMAT=<ID=FREQ,Number=1,Type=Float,Description="variant allele frequency">

The problem remained. However I figured out that the "%" makes the calculation (<,>,=) faulty. To cope this, I modified the VCF file where the "FREQ" data appears, and removed the "%" character from the entire file. This solved the issue (done in a short python script).

NOTE: It should be considered however when writing the inclusion/exclusion command: Consider the number in the FREQ position is already in it percentage value (e.g. write -i FREQ>50) for VAF>50%. I guess another way to handle it is by calculation of the VAF on the fly with the read integers of REF and ALT (I didn't try it but it should work)

Thanks all for your help!

ADD COMMENTlink written 4 months ago by tcom11110
1

Great - thanks for coming back with a detailed answer. Feel free to 'Accept' (your own answer) and/or up-vote other comments and answers that have helped. Note that sm.hashemin also suggested to use Float, even though that appeared to be just part of the issue.

ADD REPLYlink written 4 months ago by Kevin Blighe28k

Hello tcom111,

I should have asked this already, but the problem seems to me so obvious that I didn't...

How was your vcf generated? Even if you solved your problem now, could you please post the complete header of the vcf and the first few variants, that we can see how your original, unmodified file looks like?

fin swimmer

ADD REPLYlink written 4 months ago by finswimmer5.4k

I didnt get how your FREQ field looks like in your vcf file. Is there a % after the number?

ADD REPLYlink modified 4 months ago • written 4 months ago by sm.hashemin60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 957 users visited in the last hour