Question

Read multiple vcf files into R

1

Entering edit mode

5 months ago

sousapaulo16 ▴ 20

Hello all,

I am trying to import 100 vcf files (termination .recode.vcf) into R without doing it one at the time. All the files are in the same directory (~/Projects/VCF_output/) and have similar names: Replicate_1.recode.vcf; Replicate_2.recode.vcf, etc. I have been trying using for (variable in vector) {} together with read.vcfR function, after choosing my directory but without any success. Does anybody have a suggestion?

Thanks in advance

R vcf • 829 views

ADD COMMENT • link 5 months ago by sousapaulo16 ▴ 20

1

Entering edit mode

You can do something like this in your ~/Projects/VCF_output/. I have not tested it though.

library(vcfR)
temp = list.files(pattern="\\.recode.vcf$")
myfiles = lapply(temp, read.vcfR)

ADD REPLY • link 5 months ago by bk11 ★ 2.4k

1

Entering edit mode

Thanks to Francisco Pina Martins I was able to import the 100 vcf files and preforming the several statistics that I intended.

The solution that he found involves a R script that runs with the bash command :

for i in *.vcf
do
Rscript my_Script.R $i
done

The R script is:

library(vcfR)
library(dartR)
setwd("~Projects/VCF_output/")

args <- commandArgs(trailingOnly = TRUE) #args contains the name of the several vcf files

i <- read.vcfR(args[1], verbose = FALSE) #read vcf files 

i <- vcfR2genlight(i) # convert files from vcf to genlight format
# Checks a genlight object to see if it complies with dartR expectations and amends it to comply if necessary  
i_genli <- gl.compliance.check(i) 

# Calculate Ho, He and FIS
i_stats <- gl.report.heterozygosity(i_genli, method='pop')
#Save the outputs
write.csv(i_stats, paste("outputs/", args[1], ".out.csv"))

ADD REPLY • link 5 months ago by sousapaulo16 ▴ 20

0

Entering edit mode

Can you explain what you are going to do with the 100 VCFs? Do you want to keep them as separate objects read into memory? Or do you want to append them into a single vcfR object?

ADD REPLY • link 5 months ago by dthorbur ★ 1.9k

0

Entering edit mode

I will use a second R package dartR to compute basic statistics such as Expected heterozygosity

ADD REPLY • link 5 months ago by sousapaulo16 ▴ 20

0

Entering edit mode

That doesn't answer the question though. As it would be more memory efficient to read in just one VCF, calculate stats, emit results, and then move onto the next. Do you need all the VCFs to compute these stats? Just trying to understand the problem.

ADD REPLY • link 5 months ago by dthorbur ★ 1.9k

0

Entering edit mode

No worries and sorry for the incomplete answer. I am afraid that I do need all the vcf files. Each one of them is independent data set. The idea is to compute several time the same statistics in for each file and save those values so I can compare the same statistics for each vcf file. So basically, each vcf file has to produce 2/3 statistics that ideally will be saved in a dataframe

ADD REPLY • link 5 months ago by sousapaulo16 ▴ 20