Question

Retrieving all sequences of specific gene from an organism

0

Entering edit mode

5.6 years ago

marcoooo • 0

Hi,

I know that similar questions have been asked here, but I still haven't found a fitting answer.

I need to download all the nucleotide sequences of a specific gene of a virus from GenBank. Not only it is difficult to find the published sequences of the gene itself, but I would also like to find the ones within whole genomes (annotated ones of course).

For instance, let's say I need all nucleotide sequences of the "EBNA-1" gene of the Human Herpesvirus 4. Is there a way to download a fasta of all published EBNA-1s, included the ones annotated in complete genomes? The number of sequences I'm looking at are way too much to do it manually, but all serache I did give me sequences of almost random organisms. I have been mainly used the NCBI website to test the searches, and eUtils (esearch and efetch) for the downloads.

Thanks a lot in advance.

Best, Marco

gene genome database virus • 1.4k views

ADD COMMENT • link 5.6 years ago by marcoooo • 0

0

Entering edit mode

If you have already tried eUtils can you tell us how you did the search. Did that method not work?

ADD REPLY • link 5.6 years ago by GenoMax 141k

0

Entering edit mode

Mainly I tried this:

> esearch -db nucleotide -query "search_terms" | efetch -format fasta

As search terms I tried different combination, such as "EBNA-1", "EBNA-1 AND human herpesvirus 4", etc... The results I have are usually a few of the published sequences plus whole genomes.

ADD REPLY • link updated 5.6 years ago by GenoMax 141k • written 5.6 years ago by marcoooo • 0

0

Entering edit mode

Hi,

I think you can try NCBI properly. There are several options like 1. Search your gene in NCBI and fetch all published articles related to your gene. ( As I simply tested your gene name and found total article is only ~2500. ) 2. Use pubmed batch download and get these articles first. 3. First confirm the gene IDs and other information 4. You can directly download the fasta sequence from NCBI

According to my point of view its easy....:)

Enjoy

ADD REPLY • link 5.6 years ago by archana.bioinfo87 ▴ 210

0

Entering edit mode

Hi,

I apologize, but I think I'm not understanding exactly the process. How do I get from the articles to the IDs of the sequence to download from NCBI (not doing it one by one I mean, as they are thousands as you say). I can download all the sequences published with a paper, but I'll have many whole genomes, and sequences I'm not interested in, as usually they do not only publish a sequences of a single gene.

Thanks!

ADD REPLY • link 5.6 years ago by marcoooo • 0

0

Entering edit mode

You should be able to modify my script, here, such that it returns nucleotide sequences instead of protein sequences: A: How to download all sequences of a list of proteins for a particular organism

I tested it for your gene already and it works:

/usr/bin/python2.7 NucFASTASearchByFASTATitle.py -e myemail@email.ie -t "EBNA-1"
>YP_001129471.1 EBNA-1 [Human herpesvirus 4 type 2]
MSDEGPGTGPGNGLGQKEDTSGPDGSSGSGPQRRGGDNHGRGRGRGRGRGGGRPGAPGGSGSGPRHRDGV
RRPQKRPSCIGCKGAHGGTGAGGGAGAGGAGAGGAGAGGAGAGGAGAGGAGAGGAGAGGAGAGGAGAGGA
GAGGGAGAGGAGAGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGA
GAGGAGAGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGA
GAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGAGAGGGGRGRGGSGGRGRGGSGGRGRGGS
GGRRGRGRERARGGSRERARGRGRGRGEKRPRSPSSQSSSSGSPPRRPPPGRRPFFHPVAEADYFEYHQE
GGPDGEPDMPPGAIEQGPADDPGEGPSTGPRGQGDGGRRKKGGWYGKHRGEGGSSQKFENIAEGLRLLLA
RCHVERTTEDGNWVAGVFVYGGSKTSLYNLRRGIGLAIPQCRLTPLSRLPFGMAPGPGPQPGPLRESIVC
YFIVFLQTHIFAEGLKDAIKDLVLPKPAPTCNIKVTVCSFDDGVDLPPWFPPMVEGAAAEGDDGDDGDEG
GDGDEGEEGQE

>YP_401677.1 nuclear antigen EBNA-1 [Human gammaherpesvirus 4]
MSDEGPGTGPGNGLGEKGDTSGPEGSGGSGPQRRGGDNHGRGRGRGRGRGGGRPGAPGGSGSGPRHRDGV
RRPQKRPSCIGCKGTHGGTGAGAGAGGAGAGGAGAGGGAGAGGGAGGAGGAGGAGAGGGAGAGGGAGGAG
GAGAGGGAGAGGGAGGAGAGGGAGGAGGAGAGGGAGAGGGAGGAGAGGGAGGAGGAGAGGGAGAGGAGGA
GGAGAGGAGAGGGAGGAGGAGAGGAGAGGAGAGGAGAGGAGGAGAGGAGGAGAGGAGGAGAGGGAGGAGA
GGGAGGAGAGGAGGAGAGGAGGAGAGGAGGAGAGGGAGAGGAGAGGGGRGRGGSGGRGRGGSGGRGRGGS
GGRRGRGRERARGGSRERARGRGRGRGEKRPRSPSSQSSSSGSPPRRPPPGRRPFFHPVGEADYFEYHQE
GGPDGEPDVPPGAIEQGPADDPGEGPSTGPRGQGDGGRRKKGGWFGKHRGQGGSNPKFENIAEGLRALLA
RSHVERTTDEGTWVAGVFVYGGSKTSLYNLRRGTALAIPQCRLTPLSRLPFGMAPGPGPQPGPLRESIVC
YFMVFLQTHIFAEVLKDAIKDLVMTKPAPTCNIRVTVCSFDDGVDLPPWFPPMVEGAAAEGDDGDDGDEG
GDGDEGEEGQE

ADD REPLY • link 5.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks for the suggestion! I tried your script, but the number of sequences I have (even if I search for the protein ones instead of the nucleotide sequences) is really low. For instance, If I follow your example and search for "EBNA-1", I download 8 sequences (of the hundreds published for the gene). Am I missing something?