Question: GATK GenomicsDBImport - use list as input
11 days ago by
gabi30 wrote:


I am using GenomicsDBImport, but since my input file is a file containing a list of file I don't know how to use this as input with GATK along with the -V option. I've been looking for this everywhere, but I can't find this information in the GATK manual.


java -jar gatk-package- GenomicsDBImport \
        -R HG38/hs38.fa \
        --genomicsdb-workspace-path /MY_DATABASE/$newdir \
        -V $gvcf_list \
        -L resources_broad_hg38_v0_wgs_calling_regions.hg38.interval_list

Thank you in advance

written 11 days ago by gabi30

Hi Bari.ballew,

We could also do this as:

for i in *.vcf.gz; do echo `bcftools query -l $i`;echo $i;done | paste - -

Either of the scripts gives me output as below:

sample_11       sample_11_HCcalls.g.vcf.gz
sample_23      sample_23_HCcalls.g.vcf.gz
sample_9 sample_9_HCcalls.g.vcf.gz
sample_45       sample_45_HCcalls.g.vcf.gz

It is probably is my OCD :) and pardon my basic question here but what is bothering me is that why is the tab spaced so wide in all samples other than sample_9? I checked it by opening the file in vi and they are all tabs and no spaces, but just that sample_9 is off! even if it was single digit I dont think it should be that way, is'nt it? I am worried if I dont fix this and I feed into GenomicsDBimport I may eventually get an error or a wacky DB and it would end up being more work. It is just bothering me!! Any input will help!! Thankyou

written 1 day ago by geneart$$20

oops ! placed this in the wrong place. should be after Bari.ballew's comment ! my bad! but you get it .... :)

written 1 day ago by geneart$$20
4 days ago by
bari.ballew230 wrote:

Hi there! First, your sample file needs to be formatted like this: sample_name<tab>path/to/sample.vcf

If you have multi-sample VCFs (or gVCFs), you can generate a list of samples in each VCF using bcftools like this:

n=$(bcftools query -l <your.vcf>);'
echo "${n}\t<your.vcf>" >

For multiple VCFs, just cat together your sample map files.

Finally, you can run GenomicsDBImport like this (customize the options and resource allocation as needed):

gatk --java-options "-Xmx20G" GenomicsDBImport \
    --sample-name-map <> \
    --genomicsdb-workspace-path <output/path/for/database> \
    -L <interval> \

Hope that helps!

written 4 days ago by bari.ballew230
