GATK's GenomicsDBImport takes forever...

0

Entering edit mode

2.7 years ago

Joel Wallenius ▴ 210

Hello!

I have 90 samples in the form of vcf files, together they are a few terabytes in size. I wish to create a single multi-sample vcf file for downstream analysis. I am trying to use GenomicsDBImport for this, but it just takes too long (the cluster at which we run our analyses allows a maximum of 7 days runtime, which is not nearly enough apparently).

Our reference genome has 349 contigs (not human), and when running GenomicsDBImport I specify intervals corresponding to all chromosomes, all bases in every chromosome.

I put both thread options to 20, which is the maximum at our cluster.

After seven days, the database is around 15 % finished.

Are there options other than specifying a smaller set of intervals? We have no idea whatsoever what intervals to keep or not, so I'd rather not mess with those unless it's the only way...

Big thanks in advance!

Variant GATK Calling • 1.8k views

ADD COMMENT • link 2.7 years ago by Joel Wallenius ▴ 210

1

Entering edit mode

GenomicsDBImport I specify intervals corresponding to all chromosomes, all bases in every chromosome.

run GenomicsDBImport in parallel for each chromosome...

ADD REPLY • link 2.7 years ago by Pierre Lindenbaum 162k

0

Entering edit mode

I could do that of course, but then the resulting big GVCF to rule them all will also be split on chromosome, won't it?

ADD REPLY • link 2.7 years ago by Joel Wallenius ▴ 210

1

Entering edit mode

yes, that's why you could then use gatk GatherVcfs ...

ADD REPLY • link 2.7 years ago by Pierre Lindenbaum 162k

0

Entering edit mode

Ah. Seems like it was built for exactly this reason, doesn't it? ;-] Big thanks Pierre!

ADD REPLY • link 2.7 years ago by Joel Wallenius ▴ 210

Login before adding your answer.