Question: Converting Vcftools output to R readable format
1
gravatar for pifferdavide
3.8 years ago by
pifferdavide90
Italy
pifferdavide90 wrote:

I have used the following Vcftools code to calculate Fst distances from 1000 Genomes Chr 1 data.

c:/path --vcf ALL.chr1.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf --weir-fst-pop POP1.txt --weir-fst-pop POP2.txt --out fst.POP1.POP2

I need to extract the 3 variance components, namely a,b,c (between populations; between individuals within populations; between gametes within individuals within populations) that this formula uses to get to the final result.

However, Vcftools does not provide this function. I have been suggested http://stackoverflow.com/questions/30122116/extrapolating-variance-components-from-weir-fst-on-vcftools to use Hierfstat R package. However, Hierfstat does not read .vcf files. Thus, 

  1. use vcftools with --012 to get genotypes
  2. convert 0/1/2/-1 to hierfstat format (eg., 11/12/22/NA)
  3. load the data into hierfstat and compute (see below)

I have carried out step 1. But now I need to carry out step 2. I am not familiar with Hierfstat package. Can someone tell me how to convert the Vcftools output files (out.012, out.012.indv, out.012.pos) to hierfstat format ?

 

 

 

snp vcftools R vcf • 3.8k views
ADD COMMENTlink modified 3.8 years ago by Chris F.10 • written 3.8 years ago by pifferdavide90
2

If you are familiar with R, you could consider reading the VCF files directly into R and manipulating them there.  The VariantAnnotation Bioconductor package has a readVCF function to do just that.

ADD REPLYlink written 3.8 years ago by Sean Davis25k

I installed VariantAnnotation.I followed instructions to run the readVcf command at: http://www.bioconductor.org/packages/release/bioc/vignettes/VariantAnnotation/inst/doc/VariantAnnotation.pdf

I set the working directory to the folder containing the vcf file and ran the following code. I guess it is telling me that my RAM memory (8gb) is too small to open Chr1? 

> library(VariantAnnotation)
> fl <-  ("ALL.chr1.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf ")
> vcf <- readVcf(fl, "hg19")
Error: scanVcf: 'Realloc' could not re-allocate memory (5128192000 bytes)
  path: C:\Users\Davide\vcf1\ALL.chr1.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf
In addition: Warning messages:
1: In doTryCatch(return(expr), name, parentenv, handler) :
  Reached total allocation of 8077Mb: see help(memory.size)
2: In doTryCatch(return(expr), name, parentenv, handler) :
  Reached total allocation of 8077Mb: see help(memory.size)

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by pifferdavide90

Yes, you are correct about memory usage.  You can use params to control what is read in.  In particular, if you simply need genotypes and not the info() fields, check out readGT() or readGeno().

ADD REPLYlink written 3.8 years ago by Sean Davis25k

> fl <-  ("ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz")
> vcf <- readGT(fl, "hg19")
Error: scanVcf: _DNAencode(): key 48 not in lookup table
  path: C:\Users\Davide\Documents\genomes\chr1\ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
In addition: Warning message:
In .vcf_usertag(map, tag, ...) :
  ScanVcfParam ‘geno’ fields not present: ‘GT’
> vcf
Error: object 'vcf' not found

ADD REPLYlink written 3.8 years ago by pifferdavide90

See my example in the original post at http://stackoverflow.com/questions/30122116/extrapolating-variance-components-from-weir-fst-on-vcftools.  Should get your nearly all the way there:

 

 

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Chris F.10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1149 users visited in the last hour