Question

GATK GenomicsDBImport too slow

0

Entering edit mode

4 months ago

anikcropscience ▴ 230

Hello,

I have 3264 g.VCFs and an interval list for the reference genome that contains 20000 contigs. The interval list looks like the following:

utg19_pilon_pilon:1-42237
utg22_pilon_pilon:1-49947
utg24_pilon_pilon:1-61707
utg30_pilon_pilon:1-459006
utg38_pilon_pilon:1-129173
utg40_pilon_pilon:1-101813
utg58_pilon_pilon:1-143918
utg93_pilon_pilon:1-186249
utg100_pilon_pilon:1-87875
utg104_pilon_pilon:1-49315

I am running the GATK GenomicsDBImport command as follows:

gatk --java-options "-Xmx220G" GenomicsDBImport --genomicsdb-shared-posixfs-optimizations true --genomicsdb-workspace-path PlantDB --intervals intervals.list --sample-name-map samples.sample_map --batch-size 100 --bypass-feature-reader --merge-input-intervals --overwrite-existing-genomicsdb-workspace true --reader-threads 10

But the speed of the process is quite slow. How can I modify the code to accelerate the process?

GATK SNP VCF • 678 views

ADD COMMENT • link 4 months ago by anikcropscience ▴ 230

0

Entering edit mode

How can I modify the code to accelerate the process?

parallelize / merge per contig

ADD REPLY • link 4 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

So, I should loop over the interval list?

ADD REPLY • link 4 months ago by anikcropscience ▴ 230

1

Entering edit mode

use a workflow manager like snakemake, nextflow, etc,...

in NF, that would be something like:

workflow {

intervals_ch = Channel.fromPath(params.bed).splitCsv(header:false,sep:'\t').map{T->T[0]+":"+(1+(T[1] as int))+"-"+T[2]}
     CALL_GATK(intervals_ch)
 }

process CALL_GATK {
    input:
         val(interval)
    script:
    """
gatk --java-options "-Xmx220G" GenomicsDBImport --genomicsdb-shared-posixfs-optimizations true --genomicsdb-workspace-path PlantDB --intervals ${interval} --sample-name-map ${params.sample_map} --batch-size 100 --bypass-feature-reader --merge-input-intervals --overwrite-existing-genomicsdb-workspace true --reader-threads 10

(...) GenotypeGVCF....
    """

}

ADD REPLY • link 4 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Ok, thank you. I understand.

ADD REPLY • link 4 months ago by anikcropscience ▴ 230

score 0 · Answer 1 · 2023-12-12

If you have these in s3 you can use the TileDB Cloud to run a distributed ingestion. You can't specify intervals at the ingestion stage, only chromosomes/contigs. From there you could export your intervals to a VCF, but more likely you would just do your analysis on the TileDB-VCF array. If these are large gVCFs (~200M variants/sample) the ingestion would take about 48 hours.

# Location to create the TileDB-VCF dataset
dataset_uri = "s3://anikcropscience/pilonarray"

# Organization specific credentials
namespace="anik"
access_credentials_name="anik"

dag, samples = vcf.ingest(
    dataset_uri,
    namespace=namespace,
    access_credentials_name=access_credentials_name,

    # URI and pattern used to search for VCF files
    search_uri="s3://anikcropscience/sourcevcfs/",
    pattern="*g.vcf.gz",

    # Optional: Skip index creation if index already is available
    create_index=False,

    # Optional: Recommended batch size to improve read performance
    vcf_batch_size=100,

    # Optional: Limit the number of VCF files ingested
    #max_files=100,

    #ingest_resources={"cpu": "8", "memory": "16Gi"}, # This can be uncommented to declare the resources used for each ingestion node

    register_name="pilon",  # Change this to what you would like your array to be called when in TileDB Cloud

    # Optional: Limit the contigs ingested
    # contigs=["chr21"],

    consolidate_stats=False,
)