Question

weird characters in GATK vcf tables

0

Entering edit mode

4.2 years ago

ziv_attia • 0

I have created a vcf table using GATK using haplotypeCaller, genomicsDBimpirt and genotypesVCF.

the output I get is very different from the vcf4.2 format.

for example:

0/1:8,3:11:36:36,0,233 from vcftools

0|1:2,4:6:72:0|1:4938136_T_C:162,0,72:4938136 #from GATK
               ^_____________^        ^_____^

0|1:2,4:6:72:0:162,0,72 #how it should look like...

This format stuck the downstream pipeline I am used to work with.

Any idea what is it mean / how to get rid of it?

thanks!

genomics • 2.0k views

ADD COMMENT • link updated 4.1 years ago by Ram 45k • written 4.2 years ago by ziv_attia • 0

0

Entering edit mode

Please show us the exact GATK commands you used. This looks like a Find & Replace operation gone wrong.

ADD REPLY • link 4.2 years ago by Ram 45k

0

Entering edit mode

#this is the code for converting the bam files to g.vcf 

cat RG_bam_list.txt |while read file; do 
/home/pogoda/software/gatk-4.1.6.0/gatk  --java-options "-Xmx24g" HaplotypeCaller  \
    -R /home/pogoda/Sunflower_sorted_bams/Han412-HO.fasta \
    -I ${file} \
    -O /home/pogoda/GATK_microbiome_95_geno/${file}.g.vcf.gz \
    -ERC GVCF
rm ${file}
done

#this is the code for creating the data base

reference=/home/pogoda/Sunflower_sorted_bams/Han412-HO.fasta
int=chr.intervals
DIR=GDBI_96_chr_complete

/home/pogoda/software/gatk-4.1.6.0/gatk --java-options "-Xmx200g -Xms200g" GenomicsDBImport \
-R $reference \
-V B2-18DNA_0010-18_0955_RG.sorted.bam.g.vcf.gz \
-V E1-18DNA_0005-18_0950_RG.sorted.bam.g.vcf.gz \
-V G6-18DNA_0047-18_0930_RG.sorted.bam.g.vcf.gz \
-V F2-18DNA_0014-18_0959_RG.sorted.bam.g.vcf.gz \
-V C6-18DNA_0043-18_0926_RG.sorted.bam.g.vcf.gz \
--genomicsdb-workspace-path /data5/nectar/usftp21.novogene.com/raw_data/GATK_nectar/${DIR} \
--intervals ${int} \
--reader-threads 66 \

#this is the code for making the final VCF table

reference=/home/pogoda/Sunflower_sorted_bams/Han412-HO.fasta

/home/pogoda/software/gatk-4.1.6.0/gatk --java-options "-Xmx166g -Xms116g" CombineGVCFs \
-R $reference \
--variant B2-18DNA_0010-18_0955_RG.sorted.bam.g.vcf.gz \
--variant E1-18DNA_0005-18_0950_RG.sorted.bam.g.vcf.gz \
--variant G6-18DNA_0047-18_0930_RG.sorted.bam.g.vcf.gz \
--variant F2-18DNA_0014-18_0959_RG.sorted.bam.g.vcf.gz \
--variant C6-18DNA_0043-18_0926_RG.sorted.bam.g.vcf.gz \
-O CombineGVCFs.g.vcf.gz

hope this info helps

ADD REPLY • link updated 4.1 years ago by Ram 45k • written 4.2 years ago by ziv_attia • 0

0

Entering edit mode

Thank you. For the example entries you've shown in your question, can you also show us the FORMAT field from the 2 VCF files for those entries?

ADD REPLY • link 4.2 years ago by Ram 45k

0

Entering edit mode

GATK format field - GT:AD:DP:GQ:PGT:PID:PL:PS

vcftools format field - GT:DP:GL

ADD REPLY • link updated 4.2 years ago by Ram 45k • written 4.2 years ago by ziv_attia • 0

0

Entering edit mode

this is probably the reason. How do format the format of the vcf to contain only the GT:DP:GL fields ?

ADD REPLY • link 4.2 years ago by ziv_attia • 0

1

Entering edit mode

I don't think GATK giving you more information is necessarily a "problem". You can always extract the info you need from what GATK gives you. You should be able to use bcftools annotate to keep/remove FORMAT fields. Extract a small subset of your GATK VCF file and try processing it with bcftools annotate.

ADD REPLY • link 4.2 years ago by Ram 45k

0

Entering edit mode

thanks a ton! i will go through it and see how it works

ADD REPLY • link 4.2 years ago by ziv_attia • 0

0

Entering edit mode

I'm sorry but what should we see ? how any output from GATK should be similar to the 'old' vcftools ? what are the weird characters ? what is the FORMAT column associated to both outputs ?

ADD REPLY • link 4.2 years ago by Pierre Lindenbaum 166k