Question on the best blast search strategy against NCBI nucleotide database or downloaded genomes through Datasets portal
1
0
Entering edit mode
3 months ago
Mani • 0

Hi,

I am investigating to see if my query sequence is present in any fish (Teleostei) or Mollusca genomes. I am confused with two search strategies:

  1. One way is to download Eukaryotic "nt" and "Refseq" databases and blast my query sequence to the whole database and select and download the genomes with blast hits.
  2. I can also search for the fish (Teleostei) (2,131 genomes) or Mollusca (335 genomes) genomes in the Datasets portal (https://www.ncbi.nlm.nih.gov/datasets/genome/) and download all available genomes and blast my query sequence against them, but I will need lots of computer resources. My question is: Are these genomes already included in the "nt" and "Refseq" databases? If so I won't need to take the second strategy and can just search against the database, which is simpler.

Would you please help me with that?

Cheers,
Mani

Blast NCBI • 669 views
ADD COMMENT
0
Entering edit mode
nt.gz                nucleotide database from GenBank excluding the
                        batch division htgs, est, gss,sts, pat divisions, 
                        and wgs entries.  Not non-redundant.

So you are not going to get the whole genome shotgun sequences if you use nt.

That said you should still start with nt and limit your searches using taxID for Gnathostomata (7776) and Mollusca (6447).

ADD REPLY
0
Entering edit mode

Thanks for your comments

ADD REPLY
1
Entering edit mode
3 months ago
Joe 21k

nt (or nr more generally) will contain every genome NCBI currently has on deposit, and therefore should contain all of your fish and mollusc genomes.

RefSeq might contain all of these, but it is unlikely. RefSeq is a highly curated subset of nr containing complete genomes and their annotations, and generally includes representative genomes for a given organism, rather than all genomes for that organism. RefSeq is short for Reference Sequence and is intended to be the 'gold standard' genome for a given species for which a complete genome is available.

You may still get what you need using RefSeq depending on your task, but if you want to capture all the diversity in a given gene for example, you will need nr (or nt/your database of preference).

ADD COMMENT
0
Entering edit mode

Thanks for the comments

ADD REPLY

Login before adding your answer.

Traffic: 1871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6