GATK BaseRecalibrator error: input files reference and features have incompatible contigs
0
0
Entering edit mode
12 months ago
ben.ponv • 0

I tried to run GATK BaseRecalibrator using the reference hg38 genome and known sites from https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0. I used this same hg38 genome file for aligning the paired FastQ files with bwa-mem and used the known sites as following:

hg38/resources_broad_hg38_v0_1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf


The command was:

gatk BaseRecalibrator -I SP01c_marked_dup_sorted_RG.bam -R ../genomelib/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta --known-sites hg38/resources_broad_hg38_v0_1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf --known-sites hg38/resources_broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz --known-sites hg38/resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf.gz --known-sites hg38/resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf --known-sites hg38/resources_broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf.gz --known-sites hg38/resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites hg38/resources_broad_hg38_v0_hapmap_3.3.hg38.vcf.gz -O SP01c_recal_data.table")


The error was shown as follow:

A USER ERROR has occurred: Input files reference and features have incompatible contigs: Found contigs with the same name but different lengths:
contig reference = chr15 / 101991189
contig features = chr15 / 90338345
reference contigs = [chr1, chr2, chr3, ..., HLA-DRB1*16:02:01]
features contigs = [chr1, chr2, chr3, ..., chr22, chrX, chrY]


There was no output file. For the versions, I use a Windows 10 Enterprise Build 19042 and WSL2 with Ubuntu 18.04, openjdk version 11.0.9.1, bwa version 0.7.17-r1188, and gatk version 4.1.9.0.

I'd like to know how what caused the error in this incompatibility and how to fix it since I used all resources from the official resource bundle.

gatk baserecalibrator • 1.3k views
0
Entering edit mode

using the reference hg38 genome

did you use the reference genome from the very same GATK bundle ?

0
Entering edit mode

Yes, I did.

Another thing I just noted was that BaseRecalibrator ran smoothly after I removed the resources_broad_hg38_v0_1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf from the known sites list. So the compatibility should somehow come from this database.

0
Entering edit mode

Pierre Lindenbaum I'm facing a similar situation. I only have .bam files and I'm not sure whether the GATK bundle was used for creating them - but I can find from the bam header that the version used is GRCh37. Since GATK resource bundle only has hg38 will it be ok to convert them from v38 to v37 using CrossMap tool in order to use them in BaseRecalibrator?

I also downloaded dbSNP v37 from here - but I'm not sure whether this is the right file to be used as known-sites (README file there doesn't say much).

I'm really stuck because of this. I want to keep converting bam to fasta again as the last option. It would be of great help if you could please comment whether I should look for v37 files only (I'm struggling to find proper site to download each of them) or I can lift them over and use. (I'm trying this already without much success and I keep getting the same ........incompatible contigs: No overlapping contigs found error even after conversion)

UPDATE: I found some GATK resource files for hg37, which is also giving me the same error !

0
Entering edit mode

GenoMax Apologies to tag you. But, could you please suggest something here?

0
Entering edit mode

If you have the ability to reconstruct the fastq file and realign the data to proper resource files from GATK then that may be the way to go. It will be more work but you should have no trouble with downstream steps.