GATK GenomicsDBImport too slow
1
0
Entering edit mode
4 months ago

Hello,

I have 3264 g.VCFs and an interval list for the reference genome that contains 20000 contigs. The interval list looks like the following:

utg19_pilon_pilon:1-42237
utg22_pilon_pilon:1-49947
utg24_pilon_pilon:1-61707
utg30_pilon_pilon:1-459006
utg38_pilon_pilon:1-129173
utg40_pilon_pilon:1-101813
utg58_pilon_pilon:1-143918
utg93_pilon_pilon:1-186249
utg100_pilon_pilon:1-87875
utg104_pilon_pilon:1-49315

I am running the GATK GenomicsDBImport command as follows:

gatk --java-options "-Xmx220G" GenomicsDBImport --genomicsdb-shared-posixfs-optimizations true --genomicsdb-workspace-path PlantDB --intervals intervals.list --sample-name-map samples.sample_map --batch-size 100 --bypass-feature-reader --merge-input-intervals --overwrite-existing-genomicsdb-workspace true --reader-threads 10

But the speed of the process is quite slow. How can I modify the code to accelerate the process?

GATK SNP VCF • 678 views
ADD COMMENT
0
Entering edit mode

How can I modify the code to accelerate the process?

parallelize / merge per contig

ADD REPLY
0
Entering edit mode

So, I should loop over the interval list?

ADD REPLY
1
Entering edit mode

use a workflow manager like snakemake, nextflow, etc,...

in NF, that would be something like:

workflow {

intervals_ch = Channel.fromPath(params.bed).splitCsv(header:false,sep:'\t').map{T->T[0]+":"+(1+(T[1] as int))+"-"+T[2]}
     CALL_GATK(intervals_ch)
 }

process CALL_GATK {
    input:
         val(interval)
    script:
    """
gatk --java-options "-Xmx220G" GenomicsDBImport --genomicsdb-shared-posixfs-optimizations true --genomicsdb-workspace-path PlantDB --intervals ${interval} --sample-name-map ${params.sample_map} --batch-size 100 --bypass-feature-reader --merge-input-intervals --overwrite-existing-genomicsdb-workspace true --reader-threads 10

(...) GenotypeGVCF....
    """

}
ADD REPLY
0
Entering edit mode

Ok, thank you. I understand.

ADD REPLY
0
Entering edit mode
4 months ago

If you have these in s3 you can use the TileDB Cloud to run a distributed ingestion. You can't specify intervals at the ingestion stage, only chromosomes/contigs. From there you could export your intervals to a VCF, but more likely you would just do your analysis on the TileDB-VCF array. If these are large gVCFs (~200M variants/sample) the ingestion would take about 48 hours.

# Location to create the TileDB-VCF dataset
dataset_uri = "s3://anikcropscience/pilonarray"

# Organization specific credentials
namespace="anik"
access_credentials_name="anik"

dag, samples = vcf.ingest(
    dataset_uri,
    namespace=namespace,
    access_credentials_name=access_credentials_name,

    # URI and pattern used to search for VCF files
    search_uri="s3://anikcropscience/sourcevcfs/",
    pattern="*g.vcf.gz",

    # Optional: Skip index creation if index already is available
    create_index=False,

    # Optional: Recommended batch size to improve read performance
    vcf_batch_size=100,

    # Optional: Limit the number of VCF files ingested
    #max_files=100,

    #ingest_resources={"cpu": "8", "memory": "16Gi"}, # This can be uncommented to declare the resources used for each ingestion node

    register_name="pilon",  # Change this to what you would like your array to be called when in TileDB Cloud

    # Optional: Limit the contigs ingested
    # contigs=["chr21"],

    consolidate_stats=False,
)
ADD COMMENT

Login before adding your answer.

Traffic: 1697 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6