Question: Retrieving all sequences of specific gene from an organism
gravatar for marcoooo
6 months ago by
marcoooo0 wrote:


I know that similar questions have been asked here, but I still haven't found a fitting answer.

I need to download all the nucleotide sequences of a specific gene of a virus from GenBank. Not only it is difficult to find the published sequences of the gene itself, but I would also like to find the ones within whole genomes (annotated ones of course).

For instance, let's say I need all nucleotide sequences of the "EBNA-1" gene of the Human Herpesvirus 4. Is there a way to download a fasta of all published EBNA-1s, included the ones annotated in complete genomes? The number of sequences I'm looking at are way too much to do it manually, but all serache I did give me sequences of almost random organisms. I have been mainly used the NCBI website to test the searches, and eUtils (esearch and efetch) for the downloads.

Thanks a lot in advance.

Best, Marco

virus database genome gene • 262 views
ADD COMMENTlink written 6 months ago by marcoooo0

If you have already tried eUtils can you tell us how you did the search. Did that method not work?

ADD REPLYlink written 6 months ago by genomax63k

Mainly I tried this:

> esearch -db nucleotide -query "search_terms" | efetch -format fasta

As search terms I tried different combination, such as "EBNA-1", "EBNA-1 AND human herpesvirus 4", etc... The results I have are usually a few of the published sequences plus whole genomes.

ADD REPLYlink modified 6 months ago by genomax63k • written 6 months ago by marcoooo0


I think you can try NCBI properly. There are several options like 1. Search your gene in NCBI and fetch all published articles related to your gene. ( As I simply tested your gene name and found total article is only ~2500. ) 2. Use pubmed batch download and get these articles first. 3. First confirm the gene IDs and other information 4. You can directly download the fasta sequence from NCBI

According to my point of view its easy....:)


ADD REPLYlink written 6 months ago by archana.bioinfo87100


I apologize, but I think I'm not understanding exactly the process. How do I get from the articles to the IDs of the sequence to download from NCBI (not doing it one by one I mean, as they are thousands as you say). I can download all the sequences published with a paper, but I'll have many whole genomes, and sequences I'm not interested in, as usually they do not only publish a sequences of a single gene.


ADD REPLYlink written 6 months ago by marcoooo0

You should be able to modify my script, here, such that it returns nucleotide sequences instead of protein sequences: A: How to download all sequences of a list of proteins for a particular organism

I tested it for your gene already and it works:

/usr/bin/python2.7 -e -t "EBNA-1"
>YP_001129471.1 EBNA-1 [Human herpesvirus 4 type 2]

>YP_401677.1 nuclear antigen EBNA-1 [Human gammaherpesvirus 4]
ADD REPLYlink written 6 months ago by Kevin Blighe39k

Thanks for the suggestion! I tried your script, but the number of sequences I have (even if I search for the protein ones instead of the nucleotide sequences) is really low. For instance, If I follow your example and search for "EBNA-1", I download 8 sequences (of the hundreds published for the gene). Am I missing something?

ADD REPLYlink written 6 months ago by marcoooo0

The other sequences may be published but have they been submitted to Entrez?

ADD REPLYlink written 6 months ago by Kevin Blighe39k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2130 users visited in the last hour