Question: Subset VCF file by comparing CHROM,POS,REF and ALT
1
gravatar for RamRS
20 months ago by
RamRS23k
Houston, TX
RamRS23k wrote:

Hello everyone,

I've looked at vcftools, bcftools, bedops, bedtools and GATK SelectVariants, but I don't see how I can address my requirement - I hope the community can help.

I want to subset a VCF file based on 4 attributes: CHROM, POS, REF and ALT. I have 2 VCF files, and I wish to exclude all entries from VCF1 where it matches those 4 values in VCF2. Most of the above-mentioned tools work only on CHROM and POS, even if they accept a VCF file to filter by. There is no way they compare an exact match to ALT allele, even if both VCF inputs were processed to split all multi-allelic sites.

The closest I can get to is by using bcftools annotate, and by copying over an INFO attribute (say, INFO/AC) with a new name (say, INFO/DUMMY_AC) so I can filter by that new name. The manual on bcftools annotate states:

When REF and ALT are present, only matching VCF records will be annotated.

which works for me when my ALT alleles are split, but does not help me filter, only mark them.

Is there any subset tool that will help me compare by custom attributes or do I have to write my own script for it?

Thank you!

--
Ram

subset vcf • 1.0k views
ADD COMMENTlink modified 20 months ago by Pierre Lindenbaum122k • written 20 months ago by RamRS23k
1

I'd say your requirements are sufficiently complicated to go to a scripting approach, for which I would use cyvcf2 (python).

ADD REPLYlink written 20 months ago by WouterDeCoster40k

HI Ram,

Extract and write records from A shared by both A and B using exact allele match

bcftools isec A.vcf.gz B.vcf.gz -p dir -n =2 -w 1

This should work !!

Thanks
Najeeb

ADD REPLYlink modified 12 months ago by RamRS23k • written 12 months ago by always_learning980

Thank you! It's been a while since I did this exercise so I am not sure if I checked out bcftools isec (I most probably did and either didn't see this example or excluded it for a reason). Would you happen to know what exact allele match means? I don't see its definition anywhere in the documentation. Also, I'd like to exclude those positions, not pick the intersect (but maybe I could have piped it to another command that does the actual exclusion)

ADD REPLYlink written 12 months ago by RamRS23k

Just run bcftools isec on terminal and it gives exact "Extract and write records from A shared by both A and B using exact allele match" message.

ADD REPLYlink written 12 months ago by always_learning980

How does this add to the conversation exactly? This is neither an explanation of the term "exact allele match" nor is it a pointer to excluding overlaps as opposed to picking them.

ADD REPLYlink written 12 months ago by RamRS23k
2
gravatar for Pierre Lindenbaum
20 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

I've written http://lindenb.github.io/jvarkit/VcfIn.html with the option:

  -A, --allalt
      ALL user ALT must be found in VCF-database ALT

however, i've not much used this tool.

ADD COMMENTlink written 20 months ago by Pierre Lindenbaum122k
2

Of course you've written a tool for this, Pierre! And referenced this post there. How do you do it so fast?

ADD REPLYlink modified 20 months ago • written 20 months ago by RamRS23k
1

that's what she said.

ADD REPLYlink written 20 months ago by Pierre Lindenbaum122k
1

And reference this post there.

@Pierre is cross-referencing Biostars threads on individual tool pages so their practical use can be documented.

This tool was already written and waiting for you :)

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax70k

I made a typo - I meant "referenced", not "reference". He'd already cross-ref'd it and I was wondering how he did that so quickly! And yeah, I guessed the tool was already ready (given he didn't call it biostars287815 or some such, so it was the cross-referencing speed that I was shocked by :-)

ADD REPLYlink modified 12 months ago • written 20 months ago by RamRS23k
1

Just a note for future visitors: I'm accepting this answer because it makes sense, but I haven't tested it either. I wrote my own algorithm using bcftools annotate and a grep.

ADD REPLYlink written 20 months ago by RamRS23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 793 users visited in the last hour