GATK's GenomicsDBImport takes forever...
0
0
Entering edit mode
2.7 years ago

Hello!

I have 90 samples in the form of vcf files, together they are a few terabytes in size. I wish to create a single multi-sample vcf file for downstream analysis. I am trying to use GenomicsDBImport for this, but it just takes too long (the cluster at which we run our analyses allows a maximum of 7 days runtime, which is not nearly enough apparently).

Our reference genome has 349 contigs (not human), and when running GenomicsDBImport I specify intervals corresponding to all chromosomes, all bases in every chromosome.

I put both thread options to 20, which is the maximum at our cluster.

After seven days, the database is around 15 % finished.

Are there options other than specifying a smaller set of intervals? We have no idea whatsoever what intervals to keep or not, so I'd rather not mess with those unless it's the only way...

Big thanks in advance!

Variant GATK Calling • 1.8k views
ADD COMMENT
1
Entering edit mode

GenomicsDBImport I specify intervals corresponding to all chromosomes, all bases in every chromosome.

run GenomicsDBImport in parallel for each chromosome...

ADD REPLY
0
Entering edit mode

I could do that of course, but then the resulting big GVCF to rule them all will also be split on chromosome, won't it?

ADD REPLY
1
Entering edit mode

yes, that's why you could then use gatk GatherVcfs ...

ADD REPLY
0
Entering edit mode

Ah. Seems like it was built for exactly this reason, doesn't it? ;-] Big thanks Pierre!

ADD REPLY

Login before adding your answer.

Traffic: 1196 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6