Querying on non-canonical bacterial IDs
1
I am trying to automate a query for non-canonical bacterial protein IDs in NCBI.
The IDs are not standard refseq (e.g. NP_*) but instead start with HDK (e.g. GenBank: HDK9254199.1)
They exist in NCBI (https://www.ncbi.nlm.nih.gov/protein/HDK9254199 ) but I have not yet been able to recover them programmatically.
In R,
library(rentrez)
search <- entrez_search(db = "protein", term = "HDK9254199")
summary <- entrez_summary(db = "protein", id= "HDK9254199")
search <- entrez_search(db = "nuccore", term = "HDK9254199")
summary <- entrez_summary(db = "nuccore", id= "HDK9254199")
All yield nothing. The summary error our and the searches yield nothing.
In a perfect world my colleague would have used the reference genome. But, they have already ordered an expensive library corresponding to these accession terms.
Is there any way to map these HDK accessions back to canonical refseq terms?
microbes
protein
accession
• 449 views
I don't think there are any other records, including what you call canonical, for this protein.
Maybe this will help you, using Entrez Direct :
efetch -id HDK9254199 -db protein -format fasta
The screen output:
>HDK9254199.1 TPA: RluA family pseudouridine synthase [Staphylococcus aureus USA100-NRS382]
METYEFNITDKEQTGMRVDKLLPELNNDWSRNQIQDWIKAGLVVANDKVVKSNYKVKLNDHIVVTEKEVV
EADILPENLNLDIYYEDDDVAVVYKPKGMVVHPSPGHYTNTLVNGLMYQIKNLSGINGEIRPGIVHRIDM
DTSGLLMVAKNDIAHRGLVEQLMDKSVKRKYIALVHGNIPHDYGTIDAPIGRNKNDRQSMAVVDDGKEAV
THFNVLEHFKDYTLVECQLETGRTHQIRVHMKYIGFPLVGDPKYGPKKTLDIGGQALHAGLIGFEHPVTG
EYIERHAELPQDFEDLLDTIRKRDA
Assuming you have all the IDs of interest in IDs.txt:
cat IDs.txt | xargs -i efetch -id {} -db protein -format fasta >> proteins.fas
Login before adding your answer.
Traffic: 8774 users visited in the last hour