Question

How To Interface With Ncbi Nucleotide Using Bioinformatics Tools?

1

Entering edit mode

11.9 years ago

leonardo ▴ 40

I am trying to compile a list of known peptide hormones and neuropeptides from sequences deposited into NCBI's Genbank and in Nucleotide databases. I am looking to extract peptide sequences of the entire coding region, as well as the sequences of the signal peptide and mature peptide(s).

I could do this manually with only a few dozen genes at the moment; however, I know there are in excess of 100, possibly as much as ~500 total genes of interest. If I compile a list of genes, is there already a program or workflow to batch extract this information from NCBI? Are there resources you can link for me so I may get started?

Here's an example. GenBank contains the complete coding sequence (mRNA) for the INS gene, encoding for proinsulin, which is the pre-cursor to insulin (active hormone). On that GenBank page are a number of "features" of that sequence. The CDS is the full product of the INS gene, composed of the signal peptide and the mature peptide. I would like to extract these three features.

ncbi nucleotide general • 3.4k views

ADD COMMENT • link 11.9 years ago by leonardo ▴ 40

score 2 · Answer 1 · 2013-08-10

Could you give an example of such an entry? I am not sure if NCBI Nucleotide or Genbank stores this kind of information in a format that is easily retrieved programatically.

On the other hand, if they're known gene products from model organisms, you might have more luck with Uniprot.

For instance, for Neuropeptide S, you can go to http://www.uniprot.org/uniprot/P0C0P6. Programitically, you can use http://www.uniprot.org/uniprot/P0C0P6.gff to retrieve features, and http://www.uniprot.org/uniprot/P0C0P6.fasta to get the full peptide sequence.

I believe some signal peptides are manually annotated, and others are predicted using a combination of tools, such as SignalP and TargetP. Predicted ones are marked as potential. You can read more about that at the Uniprot help page.

Edit:

Using NCBI Eutilz, you can get the XML for each gene. To get the internal identifier, use esearch

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nucleotide&term= HUMINS01

That'll give you a list of IDs for your search. In this case, the ID is 186429. Depending on your search term, you'll get between 0 and 20 IDs, unless you set the attribute &retmax=<some number="">. Use the IDs for to fetch a XML file for each entry:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=186429&rettype=native&retmode=xml

Use whatever you fancy to parse the XML. What you're you most likely interested in is the "Seq-feat" blocks. These contains the "Prot-ref_processed" node which can take the value of "signal-peptide", "mature" and possibly more. I don't know if the vocabulary is controlled. Further, from this node you can also extract the "Seq-interval". Note that the internval is zero-indexed, and based off the sequence itself, and not the chromosomal location.