Hello everyone! I've been trying to use GATK with updated version of the human genome as the GATK files are outdated by ten years.
I've downloaded NCBI reference GCF_000001405.40.fna, which is GRCh38.p14
For dbSNP version, I've downloaded GCF_000001405.40.gz , which is also GRCh38.p14
When extracting the contig names from my reference file, I found:
NC_000001.11 Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly
0 252068378 NT_187361.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary As etc...
Extracting the contig names:
reference contigs = [NC_000001.11, NT_187361.1, NT_187362.1, NT_187363.1, NT_187364.1, NT_187365.1, NT_187366.1, NT_187367.1, NT_187368.1, NT_187369.1, NC_000002.12, NT_187370.1, NT_187371.1, NC_000003.12, NT_167215.1, NC_000004.12, NT_113793.3...
For dbSNP file, I found:
features contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_KI270706v1_random, chr1_KI270707v1_random...
Which causes a bunch of errors with GATK and other anotation tools.
I'm lost to which option would be the best: Converting all BAMs and reference file contig names or converting the dbSNP vcf contig names. I have no idea how to do any of them!
duplicate of dbSNP with RefSeq chromosome notation
see
bcftools annotate --rename-chrs