I am a not very experienced programmer analysing some data in VCF format.
I have a .txt file, 'regions.txt' that contains a list of chromosomal positions in the format 1:3435245-34378385 (chromosome: then window coordinates).
I would like to create a loop that does the following:
For each line in regions.txt, extract this region from vcf.gz file, then run vcftools --window-pi on the region (all regions are 10kb in size) and export results to fileofresults.txt
So that in the end there is a single text file with results for all the regions in the regions.txt file.
If anyone knows how to do this or something similar your help is really appreciated. (Or if you know why this would not be possible please let me know!)
This is a nice example where you should explore if there is any tool available to do your task more effectively than raw-text parsing of VCF files. VCF files have position/variant centric data and you have a list of regions as interval data. You can write a program to do a filtering in your language of interest or use existing tools.
I would recommend BEDtools for this task. (Manuscript here, Download here)
Make a list of your region(s) of interest and generate region specific data in a single BED file using UCSC Genome Browser - Table Browser interface.
( Read the manuscript describing Table Browser here and help files here )
Once you have your BED file ready, use intersectBed to find overlapping regions in your VCF file and region specific BED file.
VCFtools also have an option vcf-isec to do similar task. To do this you will need a VCF file of your region of interest.
Hmm thanks I will explore these suggestions. I should have made it clear I am trying to extract genotype information from multiple regions of a single vcf containing 1000 genomes data. I would then do some tests of diversity in these regions (one run of the test per region) then compare the results between regions. I suppose there is no simpler way such as specifying VCFtools only runs within a certain window in the VCF file?
If you are looking for a solution to extract multiple regions from a single VCF file, you can use TABIX as mentioned by Brent. http://biostars.org/post/show/47249/retrieve-subset-positions-vcf-file/. Have you tried that ? Also see this discussion on extracting genotype from VCF files here biostars.org/post/show/47153/how-to-get-the-genotype-from-sequence-data/
This returns BED-formatted VCF elements that overlap regions by one or more bases. The bedops tool provides other set operations and overlap criteria (you can, for example, specify percentage of overlap, up to full coverage).
Hmm thanks I will explore these suggestions. I should have made it clear I am trying to extract genotype information from multiple regions of a single vcf containing 1000 genomes data. I would then do some tests of diversity in these regions (one run of the test per region) then compare the results between regions. I suppose there is no simpler way such as specifying VCFtools only runs within a certain window in the VCF file?
If you are looking for a solution to extract multiple regions from a single VCF file, you can use TABIX as mentioned by Brent. http://biostars.org/post/show/47249/retrieve-subset-positions-vcf-file/. Have you tried that ? Also see this discussion on extracting genotype from VCF files here biostars.org/post/show/47153/how-to-get-the-genotype-from-sequence-data/