Question

Can I reach NCBI protein sequences with their corresponding NCBI urls?

0

Entering edit mode

9.2 years ago

natasha.sernova ★ 4.0k

Dear all,

Sorry, this is still a mystery to me. Why do I have to use xml or whatever, but not just a simple script like one below:

It was discussed a few times, but why it should be so complicated?

#!/usr/bin/perl

use LWP::UserAgent;

my $ua = new LWP::UserAgent;
my $response = $ua->get('http://www.example.com/');


# my url has to be like:
# http://www.ncbi.nlm.nih.gov/protein/WP_005451061.1?report=fasta&log$=seqview&format=text

unless ($response->is_success) {
        die $response->status_line;
}

my $content = $response->decoded_content();
if (utf8::is_utf8($content)) {
        binmode STDOUT,':utf8';
} else {
        binmode STDOUT,':raw';
}

print $content;

ref: http://www.microhowto.info/howto/fetch_the_content_of_a_given_url_in_perl_using_lwp_useragent.html

I have a lot of NCBI ids, like WP_005451061.1, many thousands.

I will have to find their respective UniProt ids, won't !?

http://www.ncbi.nlm.nih.gov/protein/WP_005451061.1?report=fasta&log$=seqview&format=text

Is it correct that there is no way to use the fasta-sequence encoded by the url above and I can reach it only manually? Thank you very much for your advice!

Sincerely yours,
Natalia

NCBI protein • 2.2k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by natasha.sernova ★ 4.0k

Ram · Accepted Answer · 2015-02-08

5

Entering edit mode

9.2 years ago

Siva ★ 1.9k

You can use E-utilities to get data from NCBI.

To get the amino acid sequence in FASTA format for a given ID (e.g. WP_005451061.1) or for comma separated IDs,

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=WP_005451061.1&rettype=fasta&retmode=text

ADD COMMENT • link 9.2 years ago by Siva ★ 1.9k

0

Entering edit mode

So easy?! Many thanks, I've not imagined such a clear solution! May I ask you a couple of other questions?

rettype=fasta is it critical for the format, or it may be a fasta file with *.txt extension? I'm afraid it's prohibited...

And is it possible to use somehow files with these IDs, I have too many of them for commas...

Thanks again!

Sincerely yours,
Natasha

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Yeah, it is easy. This NCBI page describes several ways to download bulk data from NCBI.

rettype=fasta is it critical for the format, or it may be a fasta file with *.txt extension? I'm afraid it's prohibited...

You are right. This table lists the valid values for rettype and retmode for EFetch.

And is it possible to use somehow files with these IDs, I have too many of them for commas...

You can use Batch Entrez described in the page I linked above.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Siva ★ 1.9k

0

Entering edit mode

Perfect! It's not so terrible as it has seemed to be...

Thousand thanks!

Sincerely yours,
Natasha

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Sorry, I have to come back. If I need a nucleotide sequence of the same protein, would it be enough just to change 'protein' to 'nucleotide' in the url, or I have to do something else? I think, id should be the same. Am I correct?

Thank you!

Natasha

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=WP_005451061.1&rettype=fasta&retmode=text

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

I've seen I was wrong. If I have the same ID, I have a protein sequence in the output even if I said 'nucleotide'. What else should be changed? I haven't noticed any significant changes in eutils... What is my mistake?

Thanks in advance.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by natasha.sernova ★ 4.0k