Question: Subset VCF file by comparing CHROM,POS,REF and ALT
1
gravatar for RamRS
15 months ago by
RamRS20k
Houston, TX
RamRS20k wrote:

Hello everyone,

I've looked at vcftools, bcftools, bedops, bedtools and GATK SelectVariants, but I don't see how I can address my requirement - I hope the community can help.

I want to subset a VCF file based on 4 attributes: CHROM, POS, REF and ALT. I have 2 VCF files, and I wish to exclude all entries from VCF1 where it matches those 4 values in VCF2. Most of the above-mentioned tools work only on CHROM and POS, even if they accept a VCF file to filter by. There is no way they compare an exact match to ALT allele, even if both VCF inputs were processed to split all multi-allelic sites.

The closest I can get to is by using bcftools annotate, and by copying over an INFO attribute (say, INFO/AC) with a new name (say, INFO/DUMMY_AC) so I can filter by that new name. The manual on bcftools annotate states:

When REF and ALT are present, only matching VCF records will be annotated.

which works for me when my ALT alleles are split, but does not help me filter, only mark them.

Is there any subset tool that will help me compare by custom attributes or do I have to write my own script for it?

Thank you!

--
Ram

subset vcf • 811 views
ADD COMMENTlink modified 15 months ago by Pierre Lindenbaum118k • written 15 months ago by RamRS20k
1

I'd say your requirements are sufficiently complicated to go to a scripting approach, for which I would use cyvcf2 (python).

ADD REPLYlink written 15 months ago by WouterDeCoster37k

HI Ram,

Extract and write records from A shared by both A and B using exact allele match

bcftools isec A.vcf.gz B.vcf.gz -p dir -n =2 -w 1

This should work !!

Thanks
Najeeb

ADD REPLYlink modified 6 months ago by RamRS20k • written 6 months ago by always_learning960

Thank you! It's been a while since I did this exercise so I am not sure if I checked out bcftools isec (I most probably did and either didn't see this example or excluded it for a reason). Would you happen to know what exact allele match means? I don't see its definition anywhere in the documentation. Also, I'd like to exclude those positions, not pick the intersect (but maybe I could have piped it to another command that does the actual exclusion)

ADD REPLYlink written 6 months ago by RamRS20k

Just run bcftools isec on terminal and it gives exact "Extract and write records from A shared by both A and B using exact allele match" message.

ADD REPLYlink written 6 months ago by always_learning960

How does this add to the conversation exactly? This is neither an explanation of the term "exact allele match" nor is it a pointer to excluding overlaps as opposed to picking them.

ADD REPLYlink written 6 months ago by RamRS20k
2
gravatar for Pierre Lindenbaum
15 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

I've written http://lindenb.github.io/jvarkit/VcfIn.html with the option:

  -A, --allalt
      ALL user ALT must be found in VCF-database ALT

however, i've not much used this tool.

ADD COMMENTlink written 15 months ago by Pierre Lindenbaum118k
2

Of course you've written a tool for this, Pierre! And referenced this post there. How do you do it so fast?

ADD REPLYlink modified 15 months ago • written 15 months ago by RamRS20k
1

that's what she said.

ADD REPLYlink written 15 months ago by Pierre Lindenbaum118k
1

And reference this post there.

@Pierre is cross-referencing Biostars threads on individual tool pages so their practical use can be documented.

This tool was already written and waiting for you :)

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax64k

I made a typo - I meant "referenced", not "reference". He'd already cross-ref'd it and I was wondering how he did that so quickly! And yeah, I guessed the tool was already ready (given he didn't call it biostars287815 or some such, so it was the cross-referencing speed that I was shocked by :-)

ADD REPLYlink modified 6 months ago • written 15 months ago by RamRS20k
1

Just a note for future visitors: I'm accepting this answer because it makes sense, but I haven't tested it either. I wrote my own algorithm using bcftools annotate and a grep.

ADD REPLYlink written 15 months ago by RamRS20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2330 users visited in the last hour