Question

Converting dbSNP VCF to work with RefSeq chromossome ID

0

Entering edit mode

13 months ago

avelarbio46 ▴ 30

Hello everyone! I've been trying to use GATK with updated version of the human genome as the GATK files are outdated by ten years.

I've downloaded NCBI reference GCF_000001405.40.fna, which is GRCh38.p14

For dbSNP version, I've downloaded GCF_000001405.40.gz , which is also GRCh38.p14

When extracting the contig names from my reference file, I found:

NC_000001.11 Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly
0 252068378 NT_187361.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary As etc...

Extracting the contig names:

reference contigs = [NC_000001.11, NT_187361.1, NT_187362.1, NT_187363.1, NT_187364.1, NT_187365.1, NT_187366.1, NT_187367.1, NT_187368.1, NT_187369.1, NC_000002.12, NT_187370.1, NT_187371.1, NC_000003.12, NT_167215.1, NC_000004.12, NT_113793.3...

For dbSNP file, I found:

features contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_KI270706v1_random, chr1_KI270707v1_random...

Which causes a bunch of errors with GATK and other anotation tools.

I'm lost to which option would be the best: Converting all BAMs and reference file contig names or converting the dbSNP vcf contig names. I have no idea how to do any of them!

NCBI dbSNP RefSeq • 1.3k views

ADD COMMENT • link updated 9 months ago by onter ▴ 170 • written 13 months ago by avelarbio46 ▴ 30

1

Entering edit mode

duplicate of dbSNP with RefSeq chromosome notation

ADD REPLY • link 13 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

see bcftools annotate --rename-chrs

ADD REPLY • link 13 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

When I downloaded this file: https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/

I got the contigs present in the NCBI reference. Although this created another problem for me which is that the dbsnp RSIDs seem to not be mapped to the main chromosomes. For example NW_015148968.1 was coming up for rs28371738 instead of the contigs chr22/NC_0000022....

ADD REPLY • link 9 months ago by onter ▴ 170