I’m relatively new to harvesting data from NCBI databases, and I am struggling some time with the following task. I try to download gene names based on a list of protein accession IDs (in text file). For example: I want to download the gene name/identification of “AAR23114.1”, going to the NCBI page of this ID (https://www.ncbi.nlm.nih.gov/protein/AAR23114.1) I find the gene name below at “CDS” at the second line: “/gene=“cyp6a2”.
I have a list of >1000 accession IDs and I want to download the subsequent gene names for all of them. Off course I have tried to find the answer myself:
- Biomart does not work for ‘regular’ gene sequences of NCBI
- I have tried to download gene information in bulk using the Batch Entrez facilities, but unfortunately the gene name information is not included for every record in the files you can download (e.g. summary or feature table -> although it is available at the individual pages!), further the information lay-out is not standardized for every record in general.
I am trying to get this done with efetch, but without any success so far. Is there a way to retrieve these gene names based on (protein) accession IDs?
Thanks in advance!