Hi,
I am building an NGS pipeline from scratch. FASTQ files have been aligned to the hg19 reference with BWA-MEM. Samtools was used for sorting and creating the index. Picard tools was used for marking duplicates and estimate the library complexity.
At this point, I want to run GATK BaseRecalibrator. However, I get this error message:
A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found.
reference contigs = [NC_000001.10, NT_113878.1, NT_167207.1, NC_000002.11, NC_000003.11, NC_000004.11, NT_113885.1, NT_113888.1, NC_000005.9, NC_000006.11, NC_000007.13, NT_113901.1, NC_000008.10, NT_113909.1, NT_113907.1, NC_000009.11, NT_113914.1, NT_113916.2, NT_113915.1, NT_113911.1, NC_000010.10, NC_000011.9, NT_113921.2, NC_000012.11, NC_000013.10, NC_000014.8, NC_000015.9, NC_000016.9, NC_000017.10, NT_113941.1, NT_113943.1, NT_113930.1, NT_113945.1, NC_000018.9, NT_113947.1, NC_000019.9, NT_113948.1, NT_113949.1, NC_000020.10, NC_000021.8, NT_113950.2, NC_000022.10, NC_000023.10, NC_000024.9, NT_113961.1, NT_113923.1, NT_167208.1, NT_167209.1, NT_167210.1, NT_167211.1, NT_167212.1, NT_113889.1, NT_167213.1, NT_167214.1, NT_167215.1, NT_167216.1, NT_167217.1, NT_167218.1, NT_167219.1, NT_167220.1, NT_167221.1, NT_167222.1, NT_167223.1, NT_167224.1, NT_167225.1, NT_167226.1, NT_167227.1, NT_167228.1, NT_167229.1, NT_167230.1, NT_167231.1, NT_167232.1, NT_167233.1, NT_167234.1, NT_167235.1, NT_167236.1, NT_167237.1, NT_167238.1, NT_167239.1, NT_167240.1, NT_167241.1, NT_167242.1, NT_167243.1, NW_004070864.2, NW_003571030.1, NW_003871056.3, NW_003871055.3, NW_003315905.1, NW_003315906.1, NW_003315907.1, NW_004070863.1, NW_003871057.1, NW_004070865.1, NW_003315903.1, NW_003315904.1, NW_003315908.1, NW_004504299.1, NW_003571032.1, NW_003571033.2, NW_003315909.1, NW_003571031.1, NW_003871060.1, NW_003871059.1, NW_003315910.1, NW_004775426.1, NW_003315911.1, NW_003871058.1, NW_003315912.1, NW_003315913.1, NW_004775427.1, NW_003315915.1, NW_003315916.1, NW_003571035.1, NW_003315914.1, NW_003571034.1, NW_003315920.1, NW_003571036.1, NW_003315917.2, NW_003315918.1, NW_003871061.1, NW_004775428.1, NW_003315919.1, NW_004070866.1, NW_003871063.1, NW_003315921.1, NW_004504300.1, NW_003871062.1, NW_004775429.1, NW_004166862.1, NW_003571039.1, NW_003571038.1, NW_004775430.1, NW_003871064.1, NW_003571041.1, NW_003571037.1, NW_003871065.1, NW_003315922.2, NW_003571040.1, NW_003571042.1, NW_004775431.1, NW_003871066.2, NW_003315923.1, NW_003315924.1, NW_003315928.1, NW_003871067.1, NW_003315929.1, NW_003315930.1, NW_003315931.1, NW_004504301.1, NW_004070869.1, NW_003315925.1, NW_004070867.1, NW_004070868.1, NW_003315926.1, NW_003315927.1, NW_003571043.1, NW_003871071.1, NW_003315932.1, NW_003315934.1, NW_003315935.1, NW_003871068.1, NW_004504302.1, NW_003871070.1, NW_004775432.1, NW_003871069.1, NW_003315933.1, NW_004070870.1, NW_003871075.1, NW_003871082.1, NW_003315936.1, NW_003571045.1, NW_003871073.1, NW_003871074.1, NW_003571046.1, NW_004070871.1, NW_003871081.1, NW_003871079.1, NW_003871077.1, NW_003871080.1, NW_003871078.1, NW_003871072.2, NW_003871076.1, NW_003571048.1, NW_003571049.1, NW_003871083.2, NW_003571047.1, NW_003571050.1, NW_003315938.1, NW_003315939.1, NW_003315941.1, NW_003315942.2, NW_004504303.2, NW_003315940.1, NW_003315937.1, NW_003571051.1, NW_004166863.1, NW_003315943.1, NW_003315944.1, NW_003871084.1, NW_003315945.1, NW_003871085.1, NW_003315946.1, NW_004070872.2, NW_003315952.2, NW_003315951.1, NW_003315950.2, NW_004775433.1, NW_003871090.1, NW_004166864.2, NW_003315949.1, NW_003315948.2, NW_003871091.1, NW_003871093.1, NW_003871092.1, NW_003315953.1, NW_003571052.1, NW_003871086.1, NW_003315947.1, NW_003871088.1, NW_003315954.1, NW_003315955.1, NW_003871089.1, NW_003871087.1, NW_003315956.1, NW_003315959.1, NW_003315960.1, NW_003315957.1, NW_003315958.1, NW_003315961.1, NW_003871094.1, NW_003571053.2, NW_003315962.1, NW_003315964.2, NW_003315965.1, NW_003315963.1, NW_004775434.1, NW_004166865.1, NW_003571054.1, NW_003571055.1, NW_003571056.1, NW_003571057.1, NW_003571058.1, NW_003571059.1, NW_003571060.1, NW_003571061.1, NW_003315966.1, NW_003871095.1, NW_004504304.1, NW_003571063.2, NW_003315967.1, NW_003315968.1, NW_003315969.1, NW_003315970.1, NW_004775435.1, NW_004070874.1, NW_004070873.1, NW_004070875.1, NW_003871096.1, NW_003315972.1, NW_003315971.2, NW_004504305.1, NW_004070876.1, NW_003571064.2, NW_003871098.1, NW_003871099.1, NW_004070879.1, NW_004166866.1, NW_004070880.2, NW_004070877.1, NW_004070881.1, NW_004070882.1, NW_003871100.1, NW_003871101.3, NW_004070883.1, NW_004070884.1, NW_004070885.1, NW_003871102.1, NW_004070878.1, NW_004070891.1, NW_004070892.1, NW_004070893.1, NW_004070886.1, NW_004070887.1, NW_004070888.1, NW_004070889.1, NW_004070890.2, NW_003871103.3, NT_167244.1, NT_113891.2, NT_167245.1, NT_167246.1, NT_167247.1, NT_167248.1, NT_167249.1, NT_167250.1, NT_167251.1, NC_012920.1]
features contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y]
After running the GATK command the first time, I saw that it needed an additional index reference.dict file. To create the file, I run gatk CreateSequenceDictionary -R reference.fasta (as recommended on this page https://gatk.broadinstitute.org/hc/en-us/articles/360035531652-FASTA-Reference-genome-format) on the same reference file that was used for all previous analysis steps.
Previously, the reference file was only processed by the bwa index reference.fasta command. I used the same reference.fasta file for the entire pipeline.
The reference files look fine to me; I assume the error arises due to the chromosome labels (features contigs) in the gnomAD.vcf file used as --known-sites in the command:
gatk BaseRecalibrator -I sample.sorted.bam -R reference.fasta --known-sites gnomad.genomes.r2.1.1.sites.vcf --known-sites gnomad.exomes.r2.1.1.sites.vcf -O recal_data.table
Am i supposed to edit these input files to match the contigs labels? Do you recommend using other population vcf files? Any other idea on how to fix this issue?
Any help would be appreciated.
Thanks for the reply. I did use the same reference file for the entire pipeline. I had a look at the
reference.fastaand thereference.dictfiles, and they look file to me. Now, I think it is a mismatch between thereference.fastacontig labels and the chromosome nomenclature in thevcffiles used for theBaseRecalibrator --known-sitesoption.I edited my post accordingly.
yes, that might be the case, how was the vcf obtained ? the
reference.fastahas to remain constantI downloaded them from the official gnomAD download page
Did you find a solution for this problem? My reference FASTA is from Ensembl and so has Ensembl chromosome naming (e.g. 1, 2, 3) but the VCF file I has contains e.g. chr1, chr2.
I eventually gave up on using the gnomAD VCF and settled for only using the dbSNP VCF, which used the same reference. You'll probably find a compatible dbSNP file, but converting the gnomAD file may be complicated. Have a look at this post.