Question: How to perform local realignment around indels without a known vcf file of indels with GATK?
0
gravatar for antonioggsousa
5 months ago by
antonioggsousa90 wrote:

Hi,

I work last few years on bioinformatics, but my knowledge is basically limited to the processing of amplicon and metagenomics NGS data. Recently, I started to work with genome assemblies and call variants but I lack the knowledge and experience.

So, I have a couple of genomes from different strains. I aligned these against the reference genome available with bwa-mem, convert the SAM to BAM and sorted the BAM files for each of the genomes that I have. Then, I want to call variants for each of the strain's genomes and then combine them in order to find the unique variants to each strain. What I want is to have a list of chromosome positions of SNPs that is unique for the strains 1, other that is unique for strain 2 and so on.

With this in mind, from what I read, I decided to use GATK to call variants. So until know, I realize that I need to use Picard to fix read groups, index the bam files (all together with

java -jar picard.jar AddOrReplaceReadGroups I=input.bam O=output.bam CREATE_INDEX=true RGID=$ RGSM=$RGSM RGLB=$RGLB RGPL=$RGPL RGPU=$RGPU #(the read groups parameters represent bash variables)

Then, I index and create a dictionary for the reference genome. After this I would like to call the variants with 'HaplotypeCaller' and I read here - https://software.broadinstitute.org/gatk/blog?id=7847 - that local realignment around indels was removed from the 'HaplotypeCaller' workflow, but still it is recommend to perform this step before 'HaplotypeCaller'.

So, I tried to run the 'RealignerTargetCreator' and 'IndelRealigner', the problem is that I don't have a vcf file with known indels for these strains.

So, now I'm running 'HaplotypeCaller' for each strain's genome:

gatk HaplotypeCaller -R $DB -I output.bam -O output.g.vcf -ERC GVCF # output.bam from the above picard command

Then, my intention is to merge all the variants in order to quantify the unique SNPs for each strain:

gatk GenotypeGVCFs -R $DB --variant ${1}.g.vcf --variant ${2}.g.vcf --variant ${3}.g.vcf --variant ${4}.g.vcf --variant ${5}.g.vcf --variant ${6}.g.vcf --variant ${7}.g.vcf --variant ${8}.g.vcf --variant ${9}.g.vcf --variant ${10}.g.vcf -o ${OUT}/merged.vcf

So, my question is: can I run 'HaplotypeCaller' in order to get the chromosome positions of indels and use these as reference for 'RealignerTargetCreator' and 'IndelRealigner'? Or I don't need them when I'm using the 'HaplotypeCaller' and, then, 'GenotypeGVCFs' and I still can get reliable chromosome positions for unique SNPs for each strain?

Thanks in advance!

Sincerely,

António

snp next-gen • 246 views
ADD COMMENTlink modified 5 months ago by Amar640 • written 5 months ago by antonioggsousa90
4
gravatar for Amar
5 months ago by
Amar640
Amar640 wrote:

GATK introduced the variant caller HaplotypeCaller which handles the preprocessing like indelrealignment: https://software.broadinstitute.org/gatk/blog?id=7847

In other words, no you do not need to use those realignment tools in GATK4. The documentation reflects this:

Note that indel realignment is no longer necessary for variant discovery if you plan to use a variant caller that performs a haplotype assembly step, such as HaplotypeCaller or MuTect2. However it is still required when using legacy callers such as UnifiedGenotyper or the original MuTect.

P.S This is an overview of all steps required to run the GATK best practices variant calling pipeline: https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145

ADD COMMENTlink modified 5 months ago • written 5 months ago by Amar640

Thank you @Marks!

I read that to, but I found at this link - https://software.broadinstitute.org/gatk/blog?id=7847 - that we should still do the local realignment, and that's why I asked.

António

ADD REPLYlink modified 5 months ago • written 5 months ago by antonioggsousa90

I agree it's confusing. That blog is very detailed but from my reading, the take home message is that HaplotypeCaller is much better at resolving indels than UnifiedGenotyper+IndelRealigner. In some cases though, such as low quality libraries, it might be beneficial to use indelrealigner alongside haplotypecaller.

Have a read of the blog again, it's very well written.

ADD REPLYlink written 5 months ago by Amar640
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2001 users visited in the last hour