GATK Haplotype Caller
0
0
Entering edit mode
5.8 years ago

Hi -

I am encountering errors when using GATK haplotypecaller with UCSC annotated VCF (bcftools), and BAM files with defined read groups. I also tried using GATK haplotype caller with a standard UCSC HG38 VCF and got the same error message which can be seen below.

My question is, what should I do to avoid this error message: no overlapping contigs found Here is my code:

Creation of contig header: awk '{printf("##contig=<id=%s,length=%d>\n",$1,$2);}' ucsc.hg38.fasta.fai > contigs_header_ucsc_hg38.txt

BCFtools code: margaret@SII-T7500-01:~/Data/UCSC/hg38$bcftools annotate -h contigs_header_ucsc_hg38.txt All_20151104.vcf.gz -O v -o /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf Warning: The index file is older than the data file: All_20151104.vcf.gz.tbi GATK haplotype caller output (error message): margaret@SII-T7500-01:~/Programs$ java -jar GenomeAnalysisTK.jar -R /home/margaret/Data/UCSC/hg38/ucsc.hg38.fasta -T HaplotypeCaller -I /home/margaret/Data/UCSC/hg38/normal_ucsc_hg38_readGroups.bam -I /home/margaret/Data/UCSC/hg38/tumor_ucsc_hg38_readGroups.bam -D /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf -o /media/margaret/Data/Data/fastqs/haplotyped_dbsnp_ucsc_hg38.vcf INFO 11:43:39,908 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:43:39,911 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29 INFO 11:43:39,911 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 11:43:39,911 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk INFO 11:43:39,912 HelpFormatter - [Wed Nov 02 11:43:39 MST 2016] Executing on Linux 4.4.0-45-generic amd64 INFO 11:43:39,912 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13 JdkDeflater INFO 11:43:39,915 HelpFormatter - Program Args: -R /home/margaret/Data/UCSC/hg38/ucsc.hg38.fasta -T HaplotypeCaller -I /home/margaret/Data/UCSC/hg38/normal_ucsc_hg38_readGroups.bam -I /home/margaret/Data/UCSC/hg38/tumor_ucsc_hg38_readGroups.bam -D /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf -o /media/margaret/Data/Data/fastqs/haplotyped_dbsnp_ucsc_hg38.vcf INFO 11:43:39,919 HelpFormatter - Executing as margaret@SII-T7500-01 on Linux 4.4.0-45-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13. INFO 11:43:39,920 HelpFormatter - Date/Time: 2016/11/02 11:43:39 INFO 11:43:39,920 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:43:39,920 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:43:39,950 GenomeAnalysisEngine - Strictness is SILENT INFO 11:43:40,029 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 11:43:40,035 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 11:43:40,060 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 INFO 11:43:40,068 HCMappingQualityFilter - Filtering out reads with MAPQ < 20

##### ERROR ------------------------------------------------------------------------------------------
next-gen GATK haplotype caller contigs • 3.8k views
3
Entering edit mode

Closing until you demonstrate that you've actually read the error message.

1
Entering edit mode

In all fairness to the user why cant the tool just say one thing:

Your chromosome names do no match, you are calling them 1,2,3 the reference is chr1, chr2...

why does this all need to be fished out from dozens of lines?

come to think of it, why not just recognize this error and run it. It is such a little thing to fix. Load it up as chr and be done with it.

2
Entering edit mode

The ways of GATK will always remain a mystery to me. I agree on simply handling with/without "chr", if we can do that in deepTools then GATK should be able to do that with their actual professional developers.

0
Entering edit mode

Can't we just start over with this entire chromosome nomenclature thing and globally agree on names?

0
Entering edit mode

I actually have a theory of why the naming of 1,2,3 for chromosome is around and seems so difficult to stamp out.

I think it is perpetuated by the (lazy) programmers that would rather generate chromosome names with a construct like range(1,22)

0
Entering edit mode

I'll take the opposing view that the Ensembl nomenclature is the sensible way and that needlessly tacking chr onto things (regardless of whether they're chromosomes) is just a waste of bytes.

1
Entering edit mode

It is about usability not waste of bytes. A lot many more resources may be wasted from misunderstandings.

When we see chr1 in any context we know it is a chromosome.

When we see a 1 it looks like a number and suggests a numerical value rather than a label. Hence the confusion.

0
Entering edit mode

The waste of storage is probably the most compelling argument for just using 1,2,3; but to me using chr1 makes more sense because the name is a string, not an integer.

1
Entering edit mode

I think the "chr1" vs. "1" debate is the "vi" vs. "emacs" of bioinformatics :P

1
Entering edit mode

However I wouldn't really care which would be the outcome if a consensus was found.

1
Entering edit mode

Good luck with consensus - I work with several nonhuman model organism biologists who insist on using Roman numerals (which really mucks up the sort order when you get to IX).

1
Entering edit mode

It would also be fine for me if we would use the names of Renaissance painters or Greek philosophers as identifiers. Sir your son has trisomy Michelangelo.

1
Entering edit mode

Or Socrates syndrome...

Getting punchy from sleep depravation. Signing off!

0
Entering edit mode

I took several steps to get the contigs in the same format chr...(i.e. changing the contig header in the post-bcftools VCF file, aligning my samples to HG38, even changing the VCF file's chrom entries to chr# ) so that this error would not continue to show up. Unfortunately, I kept getting the same error. The point of this post was to get tips on how to successfully change the contig format so that this error no longer appears. There are several bioinformaticists that encounter this error when trying to prepare their data files for various programs.

That's why I think it would be worthwhile to re-open this post.

1
Entering edit mode

Just use the same reference as -R ref.fasta with HaplotypeCaller as you used when aligning the samples. Avoid manual tampering with headers/vcf files!

0
Entering edit mode

Thanks for the suggestion, but I tried that, its just hard to see in the command I used above.

1
Entering edit mode

Please use ADD REPLY when answering to earlier comments, as such this thread remains logically structured and easy to follow.

And if that's what you did then magically somehow the chromosomes got changed. So that's not what you did. You told us earlier that you manually changed things. I've used haplotypecaller hundreds of times and never had an error like that unless you combine an Ensembl reference for variant calling with reads mapped to a UCSC reference or vice versa. The error message is very clear. The contigs in your vcf are 1, 2, 3,... while those in the reference you supplied are chr1, chr2, chr3.

0
Entering edit mode

therefore, it doesn't matter whether your reads contain "chr1" or "1", as long as the reference contains exactly the same nomenclature...

Traffic: 1680 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.