GATK GenomicsDBImport - use list as input
1
1
Entering edit mode
10 months ago
gabi ▴ 30

Hello,

I am using GenomicsDBImport, but since my input file is a file containing a list of file I don't know how to use this as input with GATK along with the -V option. I've been looking for this everywhere, but I can't find this information in the GATK manual.

gvcf_list=gvcf.listaa

java -jar gatk-package-4.1.4.0-local.jar GenomicsDBImport \
-R HG38/hs38.fa \
--genomicsdb-workspace-path /MY_DATABASE/$newdir \ -V$gvcf_list \


GATK germline VQSR GenomicsDBImport • 951 views
0
Entering edit mode

Hi Bari.ballew,

We could also do this as:

for i in *.vcf.gz; do echo bcftools query -l $i;echo$i;done | paste - -


Either of the scripts gives me output as below:

sample_11       sample_11_HCcalls.g.vcf.gz
sample_23      sample_23_HCcalls.g.vcf.gz
sample_9 sample_9_HCcalls.g.vcf.gz
sample_45       sample_45_HCcalls.g.vcf.gz


It is probably is my OCD :) and pardon my basic question here but what is bothering me is that why is the tab spaced so wide in all samples other than sample_9? I checked it by opening the file in vi and they are all tabs and no spaces, but just that sample_9 is off! even if it was single digit I dont think it should be that way, is'nt it? I am worried if I dont fix this and I feed into GenomicsDBimport I may eventually get an error or a wacky DB and it would end up being more work. It is just bothering me!! Any input will help!! Thankyou

0
Entering edit mode

oops ! placed this in the wrong place. should be after Bari.ballew's comment ! my bad! but you get it .... :)

0
Entering edit mode

Sure! As the Perl programmers say, TIMTOWTDI!

It's because of the different number of characters. sample_9 has one less character than the other samples, so it hits the default tab spacing differently.

1
Entering edit mode
10 months ago
bari.ballew ▴ 270

Hi there! First, your sample file needs to be formatted like this: sample_name<tab>path/to/sample.vcf

If you have multi-sample VCFs (or gVCFs), you can generate a list of samples in each VCF using bcftools like this:

n=$(bcftools query -l <your.vcf>);' echo "${n}\t<your.vcf>" > sample.map


For multiple VCFs, just cat together your sample map files.

Finally, you can run GenomicsDBImport like this (customize the options and resource allocation as needed):

gatk --java-options "-Xmx20G" GenomicsDBImport \
--sample-name-map <sample.map> \
--genomicsdb-workspace-path <output/path/for/database> \
-L <interval> \
--tmp-dir=<temp_directory>


Hope that helps!