Question: GATK Haplotype Caller
0
gravatar for plink_9857
23 months ago by
plink_985740
United States
plink_985740 wrote:

Hi -

I am encountering errors when using GATK haplotypecaller with UCSC annotated VCF (bcftools), and BAM files with defined read groups. I also tried using GATK haplotype caller with a standard UCSC HG38 VCF and got the same error message which can be seen below.

My question is, what should I do to avoid this error message: no overlapping contigs found Here is my code:

Creation of contig header: awk '{printf("##contig=<id=%s,length=%d>\n",$1,$2);}' ucsc.hg38.fasta.fai > contigs_header_ucsc_hg38.txt

BCFtools code: margaret@SII-T7500-01:~/Data/UCSC/hg38$ bcftools annotate -h contigs_header_ucsc_hg38.txt All_20151104.vcf.gz -O v -o /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf Warning: The index file is older than the data file: All_20151104.vcf.gz.tbi

GATK haplotype caller output (error message):


margaret@SII-T7500-01:~/Programs$ java -jar GenomeAnalysisTK.jar -R /home/margaret/Data/UCSC/hg38/ucsc.hg38.fasta -T HaplotypeCaller -I /home/margaret/Data/UCSC/hg38/normal_ucsc_hg38_readGroups.bam -I /home/margaret/Data/UCSC/hg38/tumor_ucsc_hg38_readGroups.bam -D /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf -o /media/margaret/Data/Data/fastqs/haplotyped_dbsnp_ucsc_hg38.vcf INFO 11:43:39,908 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:43:39,911 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29 INFO 11:43:39,911 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 11:43:39,911 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk INFO 11:43:39,912 HelpFormatter - [Wed Nov 02 11:43:39 MST 2016] Executing on Linux 4.4.0-45-generic amd64 INFO 11:43:39,912 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13 JdkDeflater INFO 11:43:39,915 HelpFormatter - Program Args: -R /home/margaret/Data/UCSC/hg38/ucsc.hg38.fasta -T HaplotypeCaller -I /home/margaret/Data/UCSC/hg38/normal_ucsc_hg38_readGroups.bam -I /home/margaret/Data/UCSC/hg38/tumor_ucsc_hg38_readGroups.bam -D /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf -o /media/margaret/Data/Data/fastqs/haplotyped_dbsnp_ucsc_hg38.vcf INFO 11:43:39,919 HelpFormatter - Executing as margaret@SII-T7500-01 on Linux 4.4.0-45-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13. INFO 11:43:39,920 HelpFormatter - Date/Time: 2016/11/02 11:43:39 INFO 11:43:39,920 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:43:39,920 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:43:39,950 GenomeAnalysisEngine - Strictness is SILENT INFO 11:43:40,029 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 11:43:40,035 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 11:43:40,060 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 INFO 11:43:40,068 HCMappingQualityFilter - Filtering out reads with MAPQ < 20

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.6-0-g89b7209):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Input files /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf and reference have incompatible contigs. Please see https://www.broadinstitute.org/gatk/guide/article?id=63for more information. Error details: No overlapping contigs found.
ERROR /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT]
ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]
ERROR ------------------------------------------------------------------------------------------
ADD COMMENTlink modified 21 months ago by Biostar ♦♦ 20 • written 23 months ago by plink_985740
3

Closing until you demonstrate that you've actually read the error message.

ADD REPLYlink modified 23 months ago • written 23 months ago by Devon Ryan84k
1

In all fairness to the user why cant the tool just say one thing:

Your chromosome names do no match, you are calling them 1,2,3 the reference is chr1, chr2...

why does this all need to be fished out from dozens of lines?

come to think of it, why not just recognize this error and run it. It is such a little thing to fix. Load it up as chr and be done with it.

ADD REPLYlink written 23 months ago by Istvan Albert ♦♦ 77k
2

The ways of GATK will always remain a mystery to me. I agree on simply handling with/without "chr", if we can do that in deepTools then GATK should be able to do that with their actual professional developers.

ADD REPLYlink written 23 months ago by Devon Ryan84k

Can't we just start over with this entire chromosome nomenclature thing and globally agree on names?

ADD REPLYlink written 23 months ago by WouterDeCoster32k

I actually have a theory of why the naming of 1,2,3 for chromosome is around and seems so difficult to stamp out.

I think it is perpetuated by the (lazy) programmers that would rather generate chromosome names with a construct like range(1,22)

ADD REPLYlink modified 23 months ago • written 23 months ago by Istvan Albert ♦♦ 77k

I'll take the opposing view that the Ensembl nomenclature is the sensible way and that needlessly tacking chr onto things (regardless of whether they're chromosomes) is just a waste of bytes.

ADD REPLYlink written 23 months ago by Devon Ryan84k
1

It is about usability not waste of bytes. A lot many more resources may be wasted from misunderstandings.

When we see chr1 in any context we know it is a chromosome.

When we see a 1 it looks like a number and suggests a numerical value rather than a label. Hence the confusion.

ADD REPLYlink modified 23 months ago • written 23 months ago by Istvan Albert ♦♦ 77k

The waste of storage is probably the most compelling argument for just using 1,2,3; but to me using chr1 makes more sense because the name is a string, not an integer.

ADD REPLYlink written 23 months ago by WouterDeCoster32k
1

I think the "chr1" vs. "1" debate is the "vi" vs. "emacs" of bioinformatics :P

ADD REPLYlink written 23 months ago by Devon Ryan84k
1

However I wouldn't really care which would be the outcome if a consensus was found.

ADD REPLYlink written 23 months ago by WouterDeCoster32k
1

Good luck with consensus - I work with several nonhuman model organism biologists who insist on using Roman numerals (which really mucks up the sort order when you get to IX).

ADD REPLYlink modified 22 months ago • written 22 months ago by harold.smith.tarheel4.2k
1

It would also be fine for me if we would use the names of Renaissance painters or Greek philosophers as identifiers. Sir your son has trisomy Michelangelo.

ADD REPLYlink modified 22 months ago • written 22 months ago by WouterDeCoster32k
1

Or Socrates syndrome...

Getting punchy from sleep depravation. Signing off!

ADD REPLYlink written 22 months ago by harold.smith.tarheel4.2k

I took several steps to get the contigs in the same format chr...(i.e. changing the contig header in the post-bcftools VCF file, aligning my samples to HG38, even changing the VCF file's chrom entries to chr# ) so that this error would not continue to show up. Unfortunately, I kept getting the same error. The point of this post was to get tips on how to successfully change the contig format so that this error no longer appears. There are several bioinformaticists that encounter this error when trying to prepare their data files for various programs.

That's why I think it would be worthwhile to re-open this post.

ADD REPLYlink written 22 months ago by plink_985740
1

Just use the same reference as -R ref.fasta with HaplotypeCaller as you used when aligning the samples. Avoid manual tampering with headers/vcf files!

ADD REPLYlink modified 22 months ago • written 22 months ago by WouterDeCoster32k

Thanks for the suggestion, but I tried that, its just hard to see in the command I used above.

ADD REPLYlink written 22 months ago by plink_985740
1

Please use ADD REPLY when answering to earlier comments, as such this thread remains logically structured and easy to follow.

And if that's what you did then magically somehow the chromosomes got changed. So that's not what you did. You told us earlier that you manually changed things. I've used haplotypecaller hundreds of times and never had an error like that unless you combine an Ensembl reference for variant calling with reads mapped to a UCSC reference or vice versa. The error message is very clear. The contigs in your vcf are 1, 2, 3,... while those in the reference you supplied are chr1, chr2, chr3.

ADD REPLYlink written 22 months ago by WouterDeCoster32k

therefore, it doesn't matter whether your reads contain "chr1" or "1", as long as the reference contains exactly the same nomenclature...

ADD REPLYlink written 22 months ago by TriS3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1480 users visited in the last hour