Access data from very big vcf files in R
2
1
Entering edit mode
4.2 years ago
bisansamara ▴ 10

Hi, I have a very big vcf file (11.8 GB), the header and first row look like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       13372   .       G       C       608.91  PASS    "AC=3;AC_AFR=0;AC_AMR=0

How can I need access the #CHROM and POS columns?

Note that I cannot view it in excel because it's too big. I have also tries the following, but none worked:

#1
> library(VariantAnnotation)
> vcfFile = system.file(package="VariantAnnotation", "extdata", "ExAC.r1.sites.vep.vcf.gz")
> scanVcfHeader(vcfFile)
Error in .io_check_exists(path(con)) : file(s) do not exist:
  ''

#2
> vcf<-readVcf("ExAC.r1.sites.vep.vcf.gz","hg19")
Error: cannot allocate vector of size 54 Kb

Any help is highly appreciated

R gene vcf chromosome ranges • 3.2k views
ADD COMMENT
1
Entering edit mode

I would do such task using Linux command line as discussed below, but If you really need to read it in R you can use fread from library(data.table)

awk 'BEGIN{OFS="\t"}{if(!"^#"){print $1,$2}}' <(gzip -dc yourfile.gz) | gzip > output.txt.gz

ADD REPLY
2
Entering edit mode
4.2 years ago
Ginsea Chen ▴ 130

You can extract your target information through following linux shell command: zcat ExAC.r1.sites.vep.vcf.gz | head -n x+ | awk '{print $1 $2}' > target.bed

x means the number of the first information line; target.bed is your result file.

This is a simple operation, you can contact me (cginsea@gmail.com) if you need any help about this question.

ADD COMMENT
0
Entering edit mode
4.2 years ago
d-cameron ★ 2.3k

You have insufficient memory to load the entire VCF in memory at once. The readVcf() has the optional argument param which allows you to specify not only a region of the genome that you wish to load, but also which VCF fields you want to load. By specifying the minimum number of regions, and the minimum number of fields to load, you can reduce the memory footprint of the loaded VCF.

If it's still too big to load, you could shrink your problem by only considering a subset of the data at any point in time (e.g. performing your analysis per chromosome).

Alternatively, you can use a computer with more memory.

ADD COMMENT

Login before adding your answer.

Traffic: 1841 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6