Question: Retrieving all sequences of specific gene from an organism
I know that similar questions have been asked here, but I still haven't found a fitting answer.

I need to download all the nucleotide sequences of a specific gene of a virus from GenBank. Not only it is difficult to find the published sequences of the gene itself, but I would also like to find the ones within whole genomes (annotated ones of course).

For instance, let's say I need all nucleotide sequences of the "EBNA-1" gene of the Human Herpesvirus 4. Is there a way to download a fasta of all published EBNA-1s, included the ones annotated in complete genomes? The number of sequences I'm looking at are way too much to do it manually, but all serache I did give me sequences of almost random organisms. I have been mainly used the NCBI website to test the searches, and eUtils (esearch and efetch) for the downloads.

Thanks a lot in advance.

Best, Marco

If you have already tried eUtils can you tell us how you did the search. Did that method not work?

Mainly I tried this:

> esearch -db nucleotide -query "search_terms" | efetch -format fasta

As search terms I tried different combination, such as "EBNA-1", "EBNA-1 AND human herpesvirus 4", etc... The results I have are usually a few of the published sequences plus whole genomes.

I think you can try NCBI properly. There are several options like 1. Search your gene in NCBI and fetch all published articles related to your gene. ( As I simply tested your gene name and found total article is only ~2500. ) 2. Use pubmed batch download and get these articles first. 3. First confirm the gene IDs and other information 4. You can directly download the fasta sequence from NCBI

According to my point of view its easy....:)


I apologize, but I think I'm not understanding exactly the process. How do I get from the articles to the IDs of the sequence to download from NCBI (not doing it one by one I mean, as they are thousands as you say). I can download all the sequences published with a paper, but I'll have many whole genomes, and sequences I'm not interested in, as usually they do not only publish a sequences of a single gene.


You should be able to modify my script, here, such that it returns nucleotide sequences instead of protein sequences: A: How to download all sequences of a list of proteins for a particular organism

I tested it for your gene already and it works:

/usr/bin/python2.7 -e -t "EBNA-1"
>YP_001129471.1 EBNA-1 [Human herpesvirus 4 type 2]

>YP_401677.1 nuclear antigen EBNA-1 [Human gammaherpesvirus 4]
Thanks for the suggestion! I tried your script, but the number of sequences I have (even if I search for the protein ones instead of the nucleotide sequences) is really low. For instance, If I follow your example and search for "EBNA-1", I download 8 sequences (of the hundreds published for the gene). Am I missing something?

The other sequences may be published but have they been submitted to Entrez?

