Hello!
I have 90 samples in the form of vcf files, together they are a few terabytes in size. I wish to create a single multi-sample vcf file for downstream analysis. I am trying to use GenomicsDBImport for this, but it just takes too long (the cluster at which we run our analyses allows a maximum of 7 days runtime, which is not nearly enough apparently).
Our reference genome has 349 contigs (not human), and when running GenomicsDBImport I specify intervals corresponding to all chromosomes, all bases in every chromosome.
I put both thread options to 20, which is the maximum at our cluster.
After seven days, the database is around 15 % finished.
Are there options other than specifying a smaller set of intervals? We have no idea whatsoever what intervals to keep or not, so I'd rather not mess with those unless it's the only way...
Big thanks in advance!
run GenomicsDBImport in parallel for each chromosome...
I could do that of course, but then the resulting big GVCF to rule them all will also be split on chromosome, won't it?
yes, that's why you could then use gatk GatherVcfs ...
Ah. Seems like it was built for exactly this reason, doesn't it? ;-] Big thanks Pierre!