Question: GATK GenomicsDBImport - use list as input
1
gravatar for gabi
5 months ago by
gabi30
gabi30 wrote:

Hello,

I am using GenomicsDBImport, but since my input file is a file containing a list of file I don't know how to use this as input with GATK along with the -V option. I've been looking for this everywhere, but I can't find this information in the GATK manual.

gvcf_list=gvcf.listaa

java -jar gatk-package-4.1.4.0-local.jar GenomicsDBImport \
        -R HG38/hs38.fa \
        --genomicsdb-workspace-path /MY_DATABASE/$newdir \
        -V $gvcf_list \
        -L resources_broad_hg38_v0_wgs_calling_regions.hg38.interval_list

Thank you in advance

ADD COMMENTlink modified 5 months ago by geneart$$40 • written 5 months ago by gabi30

Hi Bari.ballew,

We could also do this as:

for i in *.vcf.gz; do echo `bcftools query -l $i`;echo $i;done | paste - -

Either of the scripts gives me output as below:

sample_11       sample_11_HCcalls.g.vcf.gz
sample_23      sample_23_HCcalls.g.vcf.gz
sample_9 sample_9_HCcalls.g.vcf.gz
sample_45       sample_45_HCcalls.g.vcf.gz

It is probably is my OCD :) and pardon my basic question here but what is bothering me is that why is the tab spaced so wide in all samples other than sample_9? I checked it by opening the file in vi and they are all tabs and no spaces, but just that sample_9 is off! even if it was single digit I dont think it should be that way, is'nt it? I am worried if I dont fix this and I feed into GenomicsDBimport I may eventually get an error or a wacky DB and it would end up being more work. It is just bothering me!! Any input will help!! Thankyou

ADD REPLYlink modified 5 months ago • written 5 months ago by geneart$$40

oops ! placed this in the wrong place. should be after Bari.ballew's comment ! my bad! but you get it .... :)

ADD REPLYlink written 5 months ago by geneart$$40

Sure! As the Perl programmers say, TIMTOWTDI!

It's because of the different number of characters. sample_9 has one less character than the other samples, so it hits the default tab spacing differently.

ADD REPLYlink modified 5 months ago • written 5 months ago by bari.ballew250
1
gravatar for bari.ballew
5 months ago by
bari.ballew250
USA/NIH
bari.ballew250 wrote:

Hi there! First, your sample file needs to be formatted like this: sample_name<tab>path/to/sample.vcf

If you have multi-sample VCFs (or gVCFs), you can generate a list of samples in each VCF using bcftools like this:

n=$(bcftools query -l <your.vcf>);'
echo "${n}\t<your.vcf>" > sample.map

For multiple VCFs, just cat together your sample map files.

Finally, you can run GenomicsDBImport like this (customize the options and resource allocation as needed):

gatk --java-options "-Xmx20G" GenomicsDBImport \
    --sample-name-map <sample.map> \
    --genomicsdb-workspace-path <output/path/for/database> \
    -L <interval> \
    --tmp-dir=<temp_directory>

Hope that helps!

ADD COMMENTlink written 5 months ago by bari.ballew250
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1603 users visited in the last hour