Question: GATK GenomicsDBImport - use list as input
1
gravatar for gabi
11 days ago by
gabi30
gabi30 wrote:

Hello,

I am using GenomicsDBImport, but since my input file is a file containing a list of file I don't know how to use this as input with GATK along with the -V option. I've been looking for this everywhere, but I can't find this information in the GATK manual.

gvcf_list=gvcf.listaa

java -jar gatk-package-4.1.4.0-local.jar GenomicsDBImport \
        -R HG38/hs38.fa \
        --genomicsdb-workspace-path /MY_DATABASE/$newdir \
        -V $gvcf_list \
        -L resources_broad_hg38_v0_wgs_calling_regions.hg38.interval_list

Thank you in advance

ADD COMMENTlink modified 1 day ago by geneart$$20 • written 11 days ago by gabi30

Hi Bari.ballew,

We could also do this as:

for i in *.vcf.gz; do echo `bcftools query -l $i`;echo $i;done | paste - -

Either of the scripts gives me output as below:

sample_11       sample_11_HCcalls.g.vcf.gz
sample_23      sample_23_HCcalls.g.vcf.gz
sample_9 sample_9_HCcalls.g.vcf.gz
sample_45       sample_45_HCcalls.g.vcf.gz

It is probably is my OCD :) and pardon my basic question here but what is bothering me is that why is the tab spaced so wide in all samples other than sample_9? I checked it by opening the file in vi and they are all tabs and no spaces, but just that sample_9 is off! even if it was single digit I dont think it should be that way, is'nt it? I am worried if I dont fix this and I feed into GenomicsDBimport I may eventually get an error or a wacky DB and it would end up being more work. It is just bothering me!! Any input will help!! Thankyou

ADD REPLYlink modified 1 day ago • written 1 day ago by geneart$$20

oops ! placed this in the wrong place. should be after Bari.ballew's comment ! my bad! but you get it .... :)

ADD REPLYlink written 1 day ago by geneart$$20
1
gravatar for bari.ballew
4 days ago by
bari.ballew230
USA/NIH
bari.ballew230 wrote:

Hi there! First, your sample file needs to be formatted like this: sample_name<tab>path/to/sample.vcf

If you have multi-sample VCFs (or gVCFs), you can generate a list of samples in each VCF using bcftools like this:

n=$(bcftools query -l <your.vcf>);'
echo "${n}\t<your.vcf>" > sample.map

For multiple VCFs, just cat together your sample map files.

Finally, you can run GenomicsDBImport like this (customize the options and resource allocation as needed):

gatk --java-options "-Xmx20G" GenomicsDBImport \
    --sample-name-map <sample.map> \
    --genomicsdb-workspace-path <output/path/for/database> \
    -L <interval> \
    --tmp-dir=<temp_directory>

Hope that helps!

ADD COMMENTlink written 4 days ago by bari.ballew230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1660 users visited in the last hour