Question

Problem with import of multiVCF by readData from popgenome R package

0

Entering edit mode

3.1 years ago

pavlo.maksimov • 0

Hi all, I am trying to import a multiVCF file (created with GATK,~80 individuals (4.3Gb)) into R using the package "popgenome", and "readData". Unfortunately, the import always aborts with the error message: "R encountered a fatal error, session terminated". With a smaller data set it works fine. I also tried with compressed vcfs (bgzip) – did also not work for me. Do I miss something? Does my PC have not have enough computing resources? Has anyone had similar experiences or know how to solve this problem? About any advice I would be very grateful.

Kind regards

Pavlo

My code:

 gff3_out = c()
    my_filter = c()
    for(chr in chromosomes){
    my_filter <- list(seqid=chr)  
    gff3_out <- file.path(gff_path, paste(chr,".gff",sep=""))
    export(readGFF("/path/to/my/gff.gff",filter=my_filter), gff3_out)  
    }
    PopGenome::VCF_split_into_scaffolds("my_multiVCF_from_GATK.vcf","scaffoldVCFs2")
    allgenomes <- PopGenome::readData("path/to/data/with_VCFs",format="VCF",gffpath = "path/to/data/gff_data",big.data = TRUE)

My PC: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz 1.99 GHz;

RAM 32,0 GB (31,9 GB verwendbar);

Systemtyp 64-Bit-Betriebssystem, x64-basierter Prozessor

R popgenome readData vcf multiVCF • 1.5k views

ADD COMMENT • link 3.1 years ago by pavlo.maksimov • 0

0

Entering edit mode

https://stackoverflow.com/q/66543754/680068

ADD REPLY • link 3.1 years ago by zx8754 11k

0

Entering edit mode

import a multiVCF file [...] (4.3Gb) into R

With a smaller data set it works fine

Does my PC have not have enough computing resources?

Most likely that. Try to check memory usage while loading the data, maybe that helps in diagnosis. Anecdotal, I recently crashed my computer with 32GB memory when attempting a plotting operation on a smaller file.

ADD REPLY • link 3.1 years ago by Carambakaracho ★ 3.2k

0

Entering edit mode

Hi Carambakaracho, thank you for your help.
Yes you right, it's probably related to the fact that my 32Gb RAM is not enough. It's funny because in the paper (Pfeifer et al., Mol. Biol. Evol. 31(7):1929-1936 doi:10.1093/molbev/msu136) where the package is described, it says that even large data sets can be loaded without problems on normal PCs (8GB RAM). I have now tried the same task on the Linux server and it works great. Who knows where exactly the problem is.

ADD REPLY • link 3.1 years ago by pavlo.maksimov • 0

0

Entering edit mode

I have now checked under Windows memory usage while the record was being read by "readData" and gave no indication that a lack of RAM may be the cause of the crash.

ADD REPLY • link 3.1 years ago by pavlo.maksimov • 0

0

Entering edit mode

Unfortunately, the crash message is pretty generic and points much towards a memory issue.

Another approach would hypothesize that one or some lines in your dataset might have an offending value. So you could split the set in half and see whether you get the error with one set or the other, then split the failing set again, and so forth.

Though I wouldn't expect the error message you got for some faulty data.

ADD REPLY • link 3.1 years ago by Carambakaracho ★ 3.2k