Question

GATK GenomicsDBImport - use list as input

1

Entering edit mode

3.7 years ago

gabi ▴ 30

Hello,

I am using GenomicsDBImport, but since my input file is a file containing a list of file I don't know how to use this as input with GATK along with the -V option. I've been looking for this everywhere, but I can't find this information in the GATK manual.

gvcf_list=gvcf.listaa

java -jar gatk-package-4.1.4.0-local.jar GenomicsDBImport \
        -R HG38/hs38.fa \
        --genomicsdb-workspace-path /MY_DATABASE/$newdir \
        -V $gvcf_list \
        -L resources_broad_hg38_v0_wgs_calling_regions.hg38.interval_list

Thank you in advance

GATK germline VQSR GenomicsDBImport • 4.3k views

ADD COMMENT • link updated 3.7 years ago by geneart$$ ▴ 50 • written 3.7 years ago by gabi ▴ 30

0

Entering edit mode

Hi Bari.ballew,

We could also do this as:

for i in *.vcf.gz; do echo `bcftools query -l $i`;echo $i;done | paste - -

Either of the scripts gives me output as below:

sample_11       sample_11_HCcalls.g.vcf.gz
sample_23      sample_23_HCcalls.g.vcf.gz
sample_9 sample_9_HCcalls.g.vcf.gz
sample_45       sample_45_HCcalls.g.vcf.gz

It is probably is my OCD :) and pardon my basic question here but what is bothering me is that why is the tab spaced so wide in all samples other than sample_9? I checked it by opening the file in vi and they are all tabs and no spaces, but just that sample_9 is off! even if it was single digit I dont think it should be that way, is'nt it? I am worried if I dont fix this and I feed into GenomicsDBimport I may eventually get an error or a wacky DB and it would end up being more work. It is just bothering me!! Any input will help!! Thankyou

ADD REPLY • link 3.7 years ago by geneart$$ ▴ 50

0

Entering edit mode

oops ! placed this in the wrong place. should be after Bari.ballew's comment ! my bad! but you get it .... :)

ADD REPLY • link 3.7 years ago by geneart$$ ▴ 50

0

Entering edit mode

Sure! As the Perl programmers say, TIMTOWTDI!

It's because of the different number of characters. sample_9 has one less character than the other samples, so it hits the default tab spacing differently.

ADD REPLY • link 3.7 years ago by bari.ballew ▴ 460

score 1 · Answer 1 · 2020-08-08

Hi there! First, your sample file needs to be formatted like this: sample_name<tab>path/to/sample.vcf

If you have multi-sample VCFs (or gVCFs), you can generate a list of samples in each VCF using bcftools like this:

n=$(bcftools query -l <your.vcf>);'
echo "${n}\t<your.vcf>" > sample.map

For multiple VCFs, just cat together your sample map files.

Finally, you can run GenomicsDBImport like this (customize the options and resource allocation as needed):

gatk --java-options "-Xmx20G" GenomicsDBImport \
    --sample-name-map <sample.map> \
    --genomicsdb-workspace-path <output/path/for/database> \
    -L <interval> \
    --tmp-dir=<temp_directory>

Hope that helps!