Question

Querying on non-canonical bacterial IDs

0

Entering edit mode

7 days ago

benjamin.kellman • 0

I am trying to automate a query for non-canonical bacterial protein IDs in NCBI.

The IDs are not standard refseq (e.g. NP_*) but instead start with HDK (e.g. GenBank: HDK9254199.1)

They exist in NCBI (https://www.ncbi.nlm.nih.gov/protein/HDK9254199) but I have not yet been able to recover them programmatically.

In R,

library(rentrez)
search <- entrez_search(db = "protein", term = "HDK9254199")
summary <- entrez_summary(db = "protein", id= "HDK9254199")
search <- entrez_search(db = "nuccore", term = "HDK9254199")
summary <- entrez_summary(db = "nuccore", id= "HDK9254199")

All yield nothing. The summary error our and the searches yield nothing.

In a perfect world my colleague would have used the reference genome. But, they have already ordered an expensive library corresponding to these accession terms.

Is there any way to map these HDK accessions back to canonical refseq terms?

microbes protein accession • 307 views

ADD COMMENT • link updated 7 days ago by Mensur Dlakic ★ 29k • written 7 days ago by benjamin.kellman • 0

score 0 · Answer 1 · 2025-06-13

I don't think there are any other records, including what you call canonical, for this protein.

Maybe this will help you, using Entrez Direct:

efetch -id HDK9254199 -db protein -format fasta

The screen output:

>HDK9254199.1 TPA: RluA family pseudouridine synthase [Staphylococcus aureus USA100-NRS382]
METYEFNITDKEQTGMRVDKLLPELNNDWSRNQIQDWIKAGLVVANDKVVKSNYKVKLNDHIVVTEKEVV
EADILPENLNLDIYYEDDDVAVVYKPKGMVVHPSPGHYTNTLVNGLMYQIKNLSGINGEIRPGIVHRIDM
DTSGLLMVAKNDIAHRGLVEQLMDKSVKRKYIALVHGNIPHDYGTIDAPIGRNKNDRQSMAVVDDGKEAV
THFNVLEHFKDYTLVECQLETGRTHQIRVHMKYIGFPLVGDPKYGPKKTLDIGGQALHAGLIGFEHPVTG
EYIERHAELPQDFEDLLDTIRKRDA

Assuming you have all the IDs of interest in IDs.txt:

cat IDs.txt | xargs -i efetch -id {} -db protein -format fasta >> proteins.fas