Question: Bulk download of gene names NCBI
0
gravatar for T_18
2.3 years ago by
T_1840
T_1840 wrote:

Dear all,

I’m relatively new to harvesting data from NCBI databases, and I am struggling some time with the following task. I try to download gene names based on a list of protein accession IDs (in text file). For example: I want to download the gene name/identification of “AAR23114.1”, going to the NCBI page of this ID (https://www.ncbi.nlm.nih.gov/protein/AAR23114.1) I find the gene name below at “CDS” at the second line: “/gene=“cyp6a2”.

I have a list of >1000 accession IDs and I want to download the subsequent gene names for all of them. Off course I have tried to find the answer myself:

  • Biomart does not work for ‘regular’ gene sequences of NCBI
  • I have tried to download gene information in bulk using the Batch Entrez facilities, but unfortunately the gene name information is not included for every record in the files you can download (e.g. summary or feature table -> although it is available at the individual pages!), further the information lay-out is not standardized for every record in general.

I am trying to get this done with efetch, but without any success so far. Is there a way to retrieve these gene names based on (protein) accession IDs?

Thanks in advance!

e-utilities • 1.9k views
ADD COMMENTlink modified 2.3 years ago by Renesh1.8k • written 2.3 years ago by T_1840

although it is available at the individual pages

example ?

ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum125k

Yes: "For example: I want to download the gene name/identification of “AAR23114.1”, going to the NCBI page of this ID (https://www.ncbi.nlm.nih.gov/protein/AAR23114.1) I find the gene name below at “CDS” at the second line: “/gene=“cyp6a2”."

ADD REPLYlink written 2.3 years ago by T_1840

yes, this is your first example. I was looking for the one where the gene name is only available in the download "(e.g. summary or feature table -> although it is available at the individual pages!),"

ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum125k

I have not found a case where it is only available in the download, the problem is that it is often missing in the download. So the information is available on the gene page (see previous example) but not in the downloaded summary: (Send to> file> summary/ gene feature or any other format):

  1. cytochrome P450 [Drosophila melanogaster] 506 aa protein AAR23114.1 GI:38505146

Ideally I can download a list with all protein accessions linked to the gene names. E.g. through efetch?

ADD REPLYlink written 2.3 years ago by T_1840

Pierre: you're a (bio)star! Thanks a lot..

ADD REPLYlink written 2.3 years ago by T_1840
0
gravatar for Pierre Lindenbaum
2.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum125k wrote:

my solution using xslt:

example:

$ cat accessions.txt | xargs -n 100 echo | sed 's/ /\&id=/g' | while read S; do wget -O - -q "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=${S}&retmode=xml" | xsltproc --novalid biostar273687 - ; done  

CAA64262.1  NSP2
CAA46742.1  N/A
CAA68495.1  N/A
CAA46741.1  N/A
CAA88010.1  orf
CAA24511.1  N/A
CAA67568.1  VP7
CAA64568.1  9
CAA64658.1  9
CAA64657.1  9
CAA64659.1  9
CAA46743.1  9
CAA00124.1  N/A
5CB7_B  N/A
5CB7_A  N/A
ADD COMMENTlink written 2.3 years ago by Pierre Lindenbaum125k

I feel I'm almost there, but bumped into this error: "biostar273687.xsl:73: parser error : Premature end of data in tag stylesheet line 3 cannot parse biostar273687.xsl"

Am I correct that there could be an end tag missing? Should there be "</xsl:> on line 5?

ADD REPLYlink written 2.3 years ago by T_1840

you're right , I've badly copied, the code, I'm going to fix it !

ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum125k

Ok, I've updated, a xsl:stylesheet was missing at the end.

ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum125k
0
gravatar for Renesh
2.3 years ago by
Renesh1.8k
United States
Renesh1.8k wrote:

You can use the Batch Entrez for large number of records (https://www.ncbi.nlm.nih.gov/sites/batchentrez)

  • Save all IDs in a text file
  • Browse the text file and retrieve the proteins
  • Click on the retrieved records and it will direct you to NCBI gene summary page
  • Click on send to button (select file -> format (feature table) )
  • It will download as a file. In the downloaded file, you can see the accession and gene names.
ADD COMMENTlink written 2.3 years ago by Renesh1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1091 users visited in the last hour