Converting a VCF with SNPs and indels to BED format
2
4
Entering edit mode
7.4 years ago
gareth862 ▴ 90

Does anyone know of tools or have custom scripts (aside from "vcf2bed") that are able to convert between a VCF containing both indels and SNPs into BED format? The tricky part is creating the correct BED regions to capture the indel variants, and I haven't been able to find anything on the internet thus far.

next-gen • 23k views
ADD COMMENT
3
Entering edit mode

Hello gareth862!

It appears that your post has been cross-posted to another site: SEQanswers.

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode

What result are you expecting?

ADD REPLY
1
Entering edit mode

We are trying to get the "affected positions" from the vcf. Our plan was to use bed tools to intersect our bed file with the vcf affected positions to produce a bed file of reference ranges

  • If the allele matches the ref then it wouldn't write a range for it
  • If deletion the range would include all deleted positions
  • If a variation the range would be only the changed bases
  • 1 vcf line can generate multiple bed lines
ADD REPLY
8
Entering edit mode
7.4 years ago

The vcf2bed script will handle the last three issues:

  • Use --deletions to convert deletions to one set
  • Use --snvs to convert single-nucleotide variants to convert to a second set
  • Conversion of one VCF line will generate multiple BED lines if there are multiple alternate alleles

You can use bedops --everything to union the deletion and single-base variant sets:

$ vcf2bed --deletions < foo.vcf > foo_deletions.bed
$ vcf2bed --snvs < foo.vcf > foo_snvs.bed
$ bedops --everything foo_{deletions,snvs}.bed > foo.bed

Once you have your deletions and SNVs in one file, it's easy to deal with the first issue. The conversion script maps the REF and ALT fields of the VCF input to the sixth and seventh columns of the BED output. So you can filter out lines from the BED output where those columns are equivalent, using a simple awk statement:

$ awk '$6 != $7' foo.bed > foo_filtered.bed
ADD COMMENT
0
Entering edit mode

Alex, I get following error with the above command.

Error: Could not allocate space for VCF INFO string

I have updated the bedops to the latest version (2.4.28) as you mentioned here. This update resolves the same error from convert2bed command. But the issue with vcf2bed remains. I would like to use vcf2bed over convert2bed since the deletions and the SNVs can be processed separately using vcf2bed.

Any suggestions?

ADD REPLY
0
Entering edit mode

vcf2bed is just a wrapper that calls convert2bed. If the issue shows up with vcf2bed and does not happen with convert2bed, I would suggest there's a problem with your installation. I would look in /usr/local/bin and make sure that vcf2bed is pointing to the current convert2bed.

ADD REPLY
0
Entering edit mode

Oh yes. I am running the latest version locally (through directory). Its probably trying to access the old version which is installed globally.

ADD REPLY
1
Entering edit mode

The wrapper vcf2bed just uses whatever convert2bed it can find first in your environment PATH. You might try editing your PATH so that your local copy comes up first, or just upgrade the global copy.

Otherwise, you can use all the same functions in convert2bed as in vcf2bed, e.g.:

$ convert2bed --input=vcf --insertions < snps.vcf > insertions.bed

Etc.

You can run convert2bed --help-vcf for a listing of VCF-specific options, which should be exactly those offered by the wrapper script.

ADD REPLY
0
Entering edit mode

I used the same functions in the convert2bed as you mentioned. It works clean. Thanks!

ADD REPLY
1
Entering edit mode

Great to hear! Feel free to follow up if you have any more questions or problems.

ADD REPLY
0
Entering edit mode

Sorry,

I have some .vcf files contain only snp (I guess because I have separate vcf files for somatic indels and structural variants)

I used this command to convert vcf to bed

vcf2bed --snvs < foo.vcf > foo_snvs.bed

My bed does not have any column named strand

How I can extract the stand column from vcf please?

Thanks

ADD REPLY
1
Entering edit mode

I think, but am not 100% positive, that variant alleles are called against the forward strand-oriented reference base. So you could simply assume a "+".

Verifying this may be worth a separate question on biostars, and there are many people here who would likely be able to confirm this — or tell you that I'm wrong, which happens often. I'd double-check what I'm saying here.

If you must have a strand field in a BED file, and you can assume use of "+" (forward-oriented), and you don't mind losing the existing columns 6 and so on from what comes out of vcf2bed, then you could perhaps do something like the following:

$ cut -f1-5 foo_snvs.bed | awk -v FS="\t" -v OFS="\t" '{ print $1, $2, $3, $4, $5, "+"; }' > foo_snvs.bed6

You could do some more work with awk if you need to insert the strand column and preserve the existing columns. I'll leave that as an exercise.

ADD REPLY
0
Entering edit mode

Sorry

I don't know I got error

[fi1d18@cyan01 ~]$ module load bedtools/2.21.0
[fi1d18@cyan01 ~]$ $ cut -f1-5 foo_snvs.bed | awk -v FS="\t" -v OFS="\t" '{ print $1, $2, $3, $4, $5, "+"; }' > foo_snvs.bed6
bash: $: command not found
[fi1d18@cyan01 ~]$
ADD REPLY
1
Entering edit mode

Please try removing the leading "$" from:

$ cut -f1-5 foo_snvs.bed | awk -v FS="\t" -v OFS="\t" '{ print $1, $2, $3, $4, $5, "+"; }' > foo_snvs.bed6
ADD REPLY
0
Entering edit mode
17 months ago
svp ▴ 450

Try the following code:

awk '! /\#/' input.vcf | awk '{if(length($4) > length($5)) print $1"\t"($2-1)"\t"($2+length($4)-1); else print $1"\t"($2-1)"\t"($2+length($5)-1)}' > output.bed

This should work!

ADD COMMENT
1
Entering edit mode

Offering a slight improvement on svp's answer:

bcftools norm -m- input.vcf | grep -v '^#' | awk -v OFS='\t' '{if(length($4) > length($5)) print $1,$2,$2+length($4)-1; else print $1,$2-1,$2+length($5)-1}' | bedtools merge -i - > output.bed

A few key points:

  • breaking multi-allelic variants so that length($4) and length($5) are correct
  • for deletions (length($4) > length($5)), the deleted base actually starts at the vcf coordinate +1 (e.g. this is a 1bp deletion at position 101: chr1 100 AT A) so the correct bed 0-based coordinate is $2,$2+length($4)-1 as opposed to $2-1,$2+length($4)-1
  • we can also merge overlapping intervals using bedtools merge
ADD REPLY

Login before adding your answer.

Traffic: 2630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6