Iterating With Vcf Tools
2
1
Entering edit mode
11.9 years ago
Rubal7 ▴ 830

Hello all,

I am a not very experienced programmer analysing some data in VCF format. I have a .txt file, 'regions.txt' that contains a list of chromosomal positions in the format 1:3435245-34378385 (chromosome: then window coordinates).

I would like to create a loop that does the following: For each line in regions.txt, extract this region from vcf.gz file, then run vcftools --window-pi on the region (all regions are 10kb in size) and export results to fileofresults.txt So that in the end there is a single text file with results for all the regions in the regions.txt file.

If anyone knows how to do this or something similar your help is really appreciated. (Or if you know why this would not be possible please let me know!)

Thanks in advance for your help.

vcf genome • 4.3k views
ADD COMMENT
6
Entering edit mode
11.9 years ago

This is a nice example where you should explore if there is any tool available to do your task more effectively than raw-text parsing of VCF files. VCF files have position/variant centric data and you have a list of regions as interval data. You can write a program to do a filtering in your language of interest or use existing tools.

I would recommend BEDtools for this task. (Manuscript here, Download here)

  • Make a list of your region(s) of interest and generate region specific data in a single BED file using UCSC Genome Browser - Table Browser interface. ( Read the manuscript describing Table Browser here and help files here )

  • Once you have your BED file ready, use intersectBed to find overlapping regions in your VCF file and region specific BED file.

VCFtools also have an option vcf-isec to do similar task. To do this you will need a VCF file of your region of interest.

ADD COMMENT
0
Entering edit mode

Hmm thanks I will explore these suggestions. I should have made it clear I am trying to extract genotype information from multiple regions of a single vcf containing 1000 genomes data. I would then do some tests of diversity in these regions (one run of the test per region) then compare the results between regions. I suppose there is no simpler way such as specifying VCFtools only runs within a certain window in the VCF file?

ADD REPLY
0
Entering edit mode

If you are looking for a solution to extract multiple regions from a single VCF file, you can use TABIX as mentioned by Brent. http://biostars.org/post/show/47249/retrieve-subset-positions-vcf-file/. Have you tried that ? Also see this discussion on extracting genotype from VCF files here biostars.org/post/show/47153/how-to-get-the-genotype-from-sequence-data/

ADD REPLY
1
Entering edit mode
11.9 years ago

Convert your regions.txt and VCF files to BED files, and then use the bedops utility in the BEDOPS suite to perform an --element-of operation:

$ bedops --element-of -1 VCF.bed regions.bed > VCF_regions_of_interest.bed

This returns BED-formatted VCF elements that overlap regions by one or more bases. The bedops tool provides other set operations and overlap criteria (you can, for example, specify percentage of overlap, up to full coverage).

The helper script vcf2bed can help with the VCF-to-BED conversion step.

Once you have the BED-formatted VCF elements of interest, convert them back to VCF (essentially, you'd write a script that is the reverse of vcf2bed).

ADD COMMENT

Login before adding your answer.

Traffic: 2323 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6