Question

Mapping many GI numbers returned from ESearch to Accessions from nr.gz (non-redundant NCBI protein DB)

0

Entering edit mode

4.4 years ago

protein_guru • 0

Question: which NCBI mapping file, or collection of mapping files, contains a complete list of GI number -> Accessions. That is, given a list of GI numbers returned from an ESearch query with database = 'protein', I need the corresponding Accessions that would allow me to extract all of the corresponding protein sequences from the non-redundant protein database (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz) .

Background: I'm attempting to gather a large number of protein sequences from the NCBI, and I have begun by gathering the relevant GI numbers (~10^7 of them). I realize I could use EFetch to map these GI numbers to their corresponding accessions, but I am gathering a lot of sequences, and using these e-services has proven to be extremely painful and flaky at this scale. I think it makes more sense to set things up locally so at least that way I'm CPU/IO bound and not at the mercy of HTTP requests. Thus, I have a ton of protein GI numbers, and I want to map them to the contents of nr.gz (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/) myself. The issue is that nr.gz does not contain GI numbers, only accessions...

Other ways to achieve this also appreciated!

ncbi esearch • 1.3k views

ADD COMMENT • link updated 4.3 years ago by Biostar 20 • written 4.4 years ago by protein_guru • 0

0

Entering edit mode

If you can run your esearch query again, you can pipe it to efetch -format acc to get the accessions. Once you have the accession list, you can use the faa files from the RefSeq FTP release path to retrieve sequences. This will probably be the fastest way.

ADD REPLY • link 4.4 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

Thanks for your reply. It appears that the RefSeq chunked DB won't necessarily contain all the seqs corresponding to my GIs however, because I believe an esearch "protein" query returns hits from the entire non-redundant database, which includes GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq. Correct me if I'm wrong though. I agree that piping hits to efetch is convenient, but the command line utilities do not implement error checking necessary at this scale (so other more suitable tools are needed, with some customization). That's why I'm leaning towards doing the mapping offline if possible

ADD REPLY • link 4.4 years ago by protein_guru • 0

0

Entering edit mode

You are correct, RefSeq release will not include the other proteins. I have myself never tested Entrez Direct tools at that scale but just for the purpose of fetching accessions I think it will work. In fact you don't even need to use Entrez Direct for this purpose. You can execute your query on the NCBI Protein portal, when you see the results click on the 'Send To' link, select 'File' and choose 'Accession' as the format. It will create a file called sequences.seq that has the accessions.

ADD REPLY • link 4.4 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

Thanks, I'll give that a shot. I've also come across this post (https://ncbiinsights.ncbi.nlm.nih.gov/2016/12/23/converting-lots-of-gi-numbers-to-accession-version/) that may be what I'm looking for. Seems like a clear and cogent set of instructions from the NCBI would be helpful to do this kind of stuff (especially for someone not steeped in their nomenclature yet), but I'm also fairly new to this so I may just not have stumbled on it yet.

ADD REPLY • link 4.4 years ago by protein_guru • 0

0

Entering edit mode

vkkodali : Would this method work for 10^7 gi numbers? Those are a lot to deal with irrespective of the method.

ADD REPLY • link 4.4 years ago by GenoMax 141k

0

Entering edit mode

I think so. If you search of txid2759[Organism] in the NCBI Protein portal, you get >72M hits. I was able to download the entire list of those accessions using Send To > File method. Warning: it takes a while to download and the uncompressed text file with the accessions is big.

ADD REPLY • link 4.4 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

How did you end up using gi numbers? They were deprecated for end-users a couple of years ago. Some of the new sequences may not have gi numbers as I understand.

ADD REPLY • link 4.4 years ago by GenoMax 141k

0

Entering edit mode

Good question. It was a mistake in part. The default behavior of ESearch is to return GIs, and I didn't realize at the time the headache that would result from not specifying Accessions as the return type (as I thought I'd use EUtils for the fetching as well). Now that I've submitted a set of time-intensive and complex queries, I figured it would be time saved to just proceed by mapping the GIs to Accessions. Will grab accessions instead moving forward, but the 2016 post with the mapping script seems to be working fine. Thanks for the heads up on the lack of 1:1 mapping from GI->Accesions now. Will keep that in mind when I check these result.

ADD REPLY • link 4.4 years ago by protein_guru • 0