Question: How To Interface With Ncbi Nucleotide Using Bioinformatics Tools?
gravatar for leonardo
5.7 years ago by
leonardo40 wrote:

I am trying to compile a list of known peptide hormones and neuropeptides from sequences deposited into NCBI's Genbank and in Nucleotide databases. I am looking to extract peptide sequences of the entire coding region, as well as the sequences of the signal peptide and mature peptide(s).

I could do this manually with only a few dozen genes at the moment; however, I know there are in excess of 100, possibly as much as ~500 total genes of interest. If I compile a list of genes, is there already a program or workflow to batch extract this information from NCBI? Are there resources you can link for me so I may get started?

Here's an example. GenBank contains the complete coding sequence (mRNA) for the INS gene, encoding for proinsulin, which is the pre-cursor to insulin (active hormone). On that GenBank page are a number of "features" of that sequence. The CDS is the full product of the INS gene, composed of the signal peptide and the mature peptide. I would like to extract these three features.

ncbi nucleotide general • 2.0k views
ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by leonardo40
gravatar for David Westergaard
5.7 years ago by
Copenhagen, Denmark
David Westergaard1.4k wrote:

Could you give an example of such an entry? I am not sure if NCBI Nucleotide or Genbank stores this kind of information in a format that is easily retrieved programatically.

On the other hand, if they're known gene products from model organisms, you might have more luck with Uniprot.

For instance, for Neuropeptide S, you can go to Programitically, you can use to retrieve features, and to get the full peptide sequence.

I believe some signal peptides are manually annotated, and others are predicted using a combination of tools, such as SignalP and TargetP. Predicted ones are marked as potential. You can read more about that at the Uniprot help page.


Using NCBI Eutilz, you can get the XML for each gene. To get the internal identifier, use esearch HUMINS01

That'll give you a list of IDs for your search. In this case, the ID is 186429. Depending on your search term, you'll get between 0 and 20 IDs, unless you set the attribute &retmax=<some number="">. Use the IDs for to fetch a XML file for each entry:

Use whatever you fancy to parse the XML. What you're you most likely interested in is the "Seq-feat" blocks. These contains the "Prot-ref_processed" node which can take the value of "signal-peptide", "mature" and possibly more. I don't know if the vocabulary is controlled. Further, from this node you can also extract the "Seq-interval". Note that the internval is zero-indexed, and based off the sequence itself, and not the chromosomal location.

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by David Westergaard1.4k

I added an example, and your comment was instructive. I did not know much about the UniProt database. I was able to get most of the information I needed from this search.

ADD REPLYlink written 5.7 years ago by leonardo40

I edited my reply with a NCBI solution as well.

ADD REPLYlink written 5.7 years ago by David Westergaard1.4k

Thanks for the edit. I've accepted your answer. :)

ADD REPLYlink written 5.7 years ago by leonardo40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 931 users visited in the last hour