Question: Bulk download of gene names NCBI
0
gravatar for T_18
19 months ago by
T_1830
T_1830 wrote:

Dear all,

I’m relatively new to harvesting data from NCBI databases, and I am struggling some time with the following task. I try to download gene names based on a list of protein accession IDs (in text file). For example: I want to download the gene name/identification of “AAR23114.1”, going to the NCBI page of this ID (https://www.ncbi.nlm.nih.gov/protein/AAR23114.1) I find the gene name below at “CDS” at the second line: “/gene=“cyp6a2”.

I have a list of >1000 accession IDs and I want to download the subsequent gene names for all of them. Off course I have tried to find the answer myself:

  • Biomart does not work for ‘regular’ gene sequences of NCBI
  • I have tried to download gene information in bulk using the Batch Entrez facilities, but unfortunately the gene name information is not included for every record in the files you can download (e.g. summary or feature table -> although it is available at the individual pages!), further the information lay-out is not standardized for every record in general.

I am trying to get this done with efetch, but without any success so far. Is there a way to retrieve these gene names based on (protein) accession IDs?

Thanks in advance!

e-utilities • 1.3k views
ADD COMMENTlink modified 19 months ago by Renesh1.6k • written 19 months ago by T_1830

although it is available at the individual pages

example ?

ADD REPLYlink written 19 months ago by Pierre Lindenbaum119k

Yes: "For example: I want to download the gene name/identification of “AAR23114.1”, going to the NCBI page of this ID (https://www.ncbi.nlm.nih.gov/protein/AAR23114.1) I find the gene name below at “CDS” at the second line: “/gene=“cyp6a2”."

ADD REPLYlink written 19 months ago by T_1830

yes, this is your first example. I was looking for the one where the gene name is only available in the download "(e.g. summary or feature table -> although it is available at the individual pages!),"

ADD REPLYlink written 19 months ago by Pierre Lindenbaum119k

I have not found a case where it is only available in the download, the problem is that it is often missing in the download. So the information is available on the gene page (see previous example) but not in the downloaded summary: (Send to> file> summary/ gene feature or any other format):

  1. cytochrome P450 [Drosophila melanogaster] 506 aa protein AAR23114.1 GI:38505146

Ideally I can download a list with all protein accessions linked to the gene names. E.g. through efetch?

ADD REPLYlink written 19 months ago by T_1830

Pierre: you're a (bio)star! Thanks a lot..

ADD REPLYlink written 19 months ago by T_1830
0
gravatar for Pierre Lindenbaum
19 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

my solution using xslt:

example:

$ cat accessions.txt | xargs -n 100 echo | sed 's/ /\&id=/g' | while read S; do wget -O - -q "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=${S}&retmode=xml" | xsltproc --novalid biostar273687 - ; done  

CAA64262.1  NSP2
CAA46742.1  N/A
CAA68495.1  N/A
CAA46741.1  N/A
CAA88010.1  orf
CAA24511.1  N/A
CAA67568.1  VP7
CAA64568.1  9
CAA64658.1  9
CAA64657.1  9
CAA64659.1  9
CAA46743.1  9
CAA00124.1  N/A
5CB7_B  N/A
5CB7_A  N/A
ADD COMMENTlink written 19 months ago by Pierre Lindenbaum119k

I feel I'm almost there, but bumped into this error: "biostar273687.xsl:73: parser error : Premature end of data in tag stylesheet line 3 cannot parse biostar273687.xsl"

Am I correct that there could be an end tag missing? Should there be "</xsl:> on line 5?

ADD REPLYlink written 19 months ago by T_1830

you're right , I've badly copied, the code, I'm going to fix it !

ADD REPLYlink written 19 months ago by Pierre Lindenbaum119k

Ok, I've updated, a xsl:stylesheet was missing at the end.

ADD REPLYlink written 19 months ago by Pierre Lindenbaum119k
0
gravatar for Renesh
19 months ago by
Renesh1.6k
United States
Renesh1.6k wrote:

You can use the Batch Entrez for large number of records (https://www.ncbi.nlm.nih.gov/sites/batchentrez)

  • Save all IDs in a text file
  • Browse the text file and retrieve the proteins
  • Click on the retrieved records and it will direct you to NCBI gene summary page
  • Click on send to button (select file -> format (feature table) )
  • It will download as a file. In the downloaded file, you can see the accession and gene names.
ADD COMMENTlink written 19 months ago by Renesh1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1399 users visited in the last hour