Question: Extract Sub-Set Of Regions From Vcf File
8
gravatar for Rubal7
6.1 years ago by
Rubal7690
Rubal7690 wrote:

Hello all,

Sorry if this has been answered before (though I have not found an appropriate answer already) What is an effective way to extract mutliple chromosomal regions from a vcf.gz file? I see that it can be done for single regions using tabix, but I have a file with several regions in a text file in the format:

chromosome:startposition-endposition eg: 19:47366525-47380539

Perhaps there is a way to refer to a file of regions using tabix, or vcf-tools? If anyone could provide an example command line that would be really helpful.

Thanks in advance!

vcf genome tabix filter • 28k views
ADD COMMENTlink modified 9 months ago by geocarvalho80 • written 6.1 years ago by Rubal7690
$ echo -e "chrN\t123\t456" | bedops -e 1 <(vcf2bed variants.bed) - > answer.bed
ADD REPLYlink written 5 months ago by Alex Reynolds24k
14
gravatar for Zev.Kronenberg
6.1 years ago by
United States
Zev.Kronenberg11k wrote:

I find tabix to be much easier / much sorter command line arguments.

You don't need a fasta for this method. Tabix is screaming fast.

bgzip your.vcf
tabix -p vcf your.vcf
tabix your.vcf.gz chr1:10,000,000-20,000,000
ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by Zev.Kronenberg11k

Thanks, I agree it works fast. Shame it doesn't take a file with a whole list of regions. But I can put it into a loop.

ADD REPLYlink written 6.1 years ago by Rubal7690
2

Two ways: 1) if there are many regions: tabix -fB my.vcf.gz reg.bed 2) if you only have a handful of regions: awk '{print $1":"($2+1)"-"$3}' reg.bed | xargs tabix my.vcf.gz.

ADD REPLYlink written 4.6 years ago by lh330k
5

For anyone arriving years later, the command appears to have changed. The command you probably want is: tabix -R reg.bed my.vcf.gz.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by SES8.1k

None of these work anymore.

ADD REPLYlink written 2.0 years ago by MAPK1.2k
2
tabix -R region.txt my.vcf.gz

worked for me

ADD REPLYlink written 21 months ago by xuanbonn20
12
gravatar for gourneau
4.6 years ago by
gourneau150
San Francisco, CA
gourneau150 wrote:

For others if you like to use BED files, one way to do this would be like so:

vcftools --vcf input.vcf --bed bed_file_describing_the_range.bed --out output_prefix --recode --keep-INFO-all

Where bed_file_describing_the_range.bed has the positions of interest.

ADD COMMENTlink modified 3.3 years ago • written 4.6 years ago by gourneau150

I am trying to extract specific regions from a VCF file that has SNP information for 16 individuals. My bed file has three columns (scaffold name, start of region, end of region).

When I run vcftools on these files, it runs through the entire vcf file and determines that there are no regions from the bed file contained in the vcf file. A recode vcf file is output, but it only contains the header information from the vcf file.

ADD REPLYlink written 23 months ago by eabowman0

Check to ensure your scaffold names match in the BED and VCF files. For example, if your BED file has "chr1" but your VCF file just has "1" for chromosome 1 it will return an empty file.

ADD REPLYlink written 23 months ago by alolex890
1
gravatar for Johan
6.1 years ago by
Johan790
Sweden
Johan790 wrote:

You could use the GATK SelectVariants Walker (http://www.broadinstitute.org/gsa/gatkdocs/release/orgbroadinstitutestinggatkwalkersvariantutilsSelectVariants.html), specifying the -L option to set the intervals your are interested in.

Here is an example of how to do that, taken from the link above:

#Select a sample and restrict the output vcf to a set of intervals:     
java -Xmx2g -jar GenomeAnalysisTK.jar \
       -R ref.fasta \
       -T SelectVariants \
       --variant input.vcf \
       -o output.vcf \
       -L /path/to/my.interval_list \
       -sn SAMPLE_1_ACTG

Note that you have to have a fasta reference file with the same names and coordinates as you use in our region file and vcf, and that you can remove the -sn option if you want to extract the variations for all samples.

Hope this helps.

ADD COMMENTlink written 6.1 years ago by Johan790

thanks, I can potentially use this, but having some issues with the reference fasta, so a method that does not require a fasta reference would be preferred.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Rubal7690

I'm not sure if it requires uncompressing the vcf files. I'm not that familiar with vcftools, so I couldn't say if there is an easier way to do it using that suite.

ADD REPLYlink written 6.1 years ago by Johan790
1
gravatar for geocarvalho
9 months ago by
geocarvalho80
Brazil/Recife
geocarvalho80 wrote:

bedtools intersect worked better to me

intersectBed -a input.vcf -b /path/to/my.interval.bed -header > output.vcf

ADD COMMENTlink written 9 months ago by geocarvalho80

For some reason by intersected out VCF have more line compare to input vcf. :( :( No idea what could be the possible reason for this !! Any guess ?

ADD REPLYlink written 5 months ago by always_learning890
0
gravatar for Jeremy Leipzig
4.6 years ago by
Philadelphia, PA
Jeremy Leipzig17k wrote:

I did something like what you describe today, but to create multiple VCF files. Uses bedtools:

<script src="&lt;a href=" leipzig="" 7452498"="">leipzig/7452498"></script>

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Jeremy Leipzig17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1573 users visited in the last hour