Question: Subset VCF file by comparing CHROM,POS,REF and ALT
1
gravatar for RamRS
2.7 years ago by
RamRS28k
Houston, TX
RamRS28k wrote:

Hello everyone,

I've looked at vcftools, bcftools, bedops, bedtools and GATK SelectVariants, but I don't see how I can address my requirement - I hope the community can help.

I want to subset a VCF file based on 4 attributes: CHROM, POS, REF and ALT. I have 2 VCF files, and I wish to exclude all entries from VCF1 where it matches those 4 values in VCF2. Most of the above-mentioned tools work only on CHROM and POS, even if they accept a VCF file to filter by. There is no way they compare an exact match to ALT allele, even if both VCF inputs were processed to split all multi-allelic sites.

The closest I can get to is by using bcftools annotate, and by copying over an INFO attribute (say, INFO/AC) with a new name (say, INFO/DUMMY_AC) so I can filter by that new name. The manual on bcftools annotate states:

When REF and ALT are present, only matching VCF records will be annotated.

which works for me when my ALT alleles are split, but does not help me filter, only mark them.

Is there any subset tool that will help me compare by custom attributes or do I have to write my own script for it?

Thank you!

--
Ram

subset vcf • 1.4k views
ADD COMMENTlink modified 2.7 years ago by Pierre Lindenbaum129k • written 2.7 years ago by RamRS28k
1

I'd say your requirements are sufficiently complicated to go to a scripting approach, for which I would use cyvcf2 (python).

ADD REPLYlink written 2.7 years ago by WouterDeCoster44k

HI Ram,

Extract and write records from A shared by both A and B using exact allele match

bcftools isec A.vcf.gz B.vcf.gz -p dir -n =2 -w 1

This should work !!

Thanks
Najeeb

ADD REPLYlink modified 23 months ago by RamRS28k • written 23 months ago by always_learning1.0k

Thank you! It's been a while since I did this exercise so I am not sure if I checked out bcftools isec (I most probably did and either didn't see this example or excluded it for a reason). Would you happen to know what exact allele match means? I don't see its definition anywhere in the documentation. Also, I'd like to exclude those positions, not pick the intersect (but maybe I could have piped it to another command that does the actual exclusion)

ADD REPLYlink written 23 months ago by RamRS28k

Just run bcftools isec on terminal and it gives exact "Extract and write records from A shared by both A and B using exact allele match" message.

ADD REPLYlink written 23 months ago by always_learning1.0k

How does this add to the conversation exactly? This is neither an explanation of the term "exact allele match" nor is it a pointer to excluding overlaps as opposed to picking them.

ADD REPLYlink written 23 months ago by RamRS28k
2
gravatar for Pierre Lindenbaum
2.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

I've written http://lindenb.github.io/jvarkit/VcfIn.html with the option:

  -A, --allalt
      ALL user ALT must be found in VCF-database ALT

however, i've not much used this tool.

ADD COMMENTlink written 2.7 years ago by Pierre Lindenbaum129k
2

Of course you've written a tool for this, Pierre! And referenced this post there. How do you do it so fast?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by RamRS28k
1

that's what she said.

ADD REPLYlink written 2.7 years ago by Pierre Lindenbaum129k
1

And reference this post there.

@Pierre is cross-referencing Biostars threads on individual tool pages so their practical use can be documented.

This tool was already written and waiting for you :)

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax87k

I made a typo - I meant "referenced", not "reference". He'd already cross-ref'd it and I was wondering how he did that so quickly! And yeah, I guessed the tool was already ready (given he didn't call it biostars287815 or some such, so it was the cross-referencing speed that I was shocked by :-)

ADD REPLYlink modified 23 months ago • written 2.7 years ago by RamRS28k
1

Just a note for future visitors: I'm accepting this answer because it makes sense, but I haven't tested it either. I wrote my own algorithm using bcftools annotate and a grep.

ADD REPLYlink written 2.7 years ago by RamRS28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1247 users visited in the last hour