2
0
Entering edit mode
5.7 years ago
ziv84 • 0

Hi all,

I'm new with Biostars - so, sorry if anything.

I'm also new with local blast and my question might be a bit "stupid" but I've already spent almost two days trying to solve the problem and totally have know idea how to do that... I've already learned BLAST manual pages, googled the subject and read several relevant topics in here and on Seqanswers as well, but haven't got any success.. I would be very appreciative for any help.

So, the subject is: I'm trying to "restrict" nt BLAST database to perform search of my queries (several millions of short reads) against sequences which belong only to certain taxon(s). My pipeline to do that was based on this topic Vertebrate Subset Nr Database? Build My Own? and included such steps:

1) getting id of taxon of interest using NCBI Taxonomy Browser. E.g. it was taxon Chlorophyta. According to the browser the taxon has id as 3041 ( http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=3041&lvl=3&lin=f&keep=1&srchmode=1&unlock )

2) getting GI list of genes associated with the taxon using NCBI Nucleotide database. I've performed search with the phrase "txid3041[Orgn]" in Nucleotide part of NCBI. It said "Found 1052995 nucleotide sequences.   Nucleotide (427919)   EST (569265)   GSS (55811)". Ok, I've exported the GI list via "Send To" -> "File" -> "GI List" and got text file Chlorophyta_nucleotide.gi.txt with list of numbers (guess gene IDs (gids)):

916534476
916534474
916534472
916534470
916534468
916534466

...

3) Further using blastcmd I've tried to extract from my local nt base sequences matching to the list of gids with the following command:

blastdbcmd -db nt -dbtype 'nucl' -entry_batch ../Chlorophyta_nucleotide.gi.txt -out nt_tst.fa

BUT got a lot of errors like:

Second attempt was with redirecting of errors into separate file error.log (-logfile option)

The result is:

427919 - lines in file with gids (Chlorophyta_nucleotide.gi.txt) - it's in agreement with the report of search in Nucleotide database (see above)

287257 - lines in file error.log (all of them looks like "Error: XXXXXXXXX: OID not found")

140662 - lines starting with ">" in output file nt_tst.fa

140662+287257=427919 - seems that far from all entries in list of gids have been found in nt database

After some time with google I've learn that the problem could be connected with preparing of local nt database: if you are using manually prepared nt database (downloading fasta files with nt base followed by preparing of the base with makeblastdb script) it's important to use -parse_id option to escape the problem "OID not found" in future.

But I downloaded pre-formatted version of nt base from NSBI's ftp. According to their README pre-formatted nt base don't need further preparation before making a search (i.e. it's "ready-to-use"). Anyway to be sure that my nt base in "good shape" I performed simple blastn - it works.

Further check:

blastdbcmd -db nt -info

Database: Nucleotide collection (nt)
30,527,720 sequences; 95,485,076,457 total bases

Date: Jun 17, 2015  3:03 AM    Longest sequence: 774,434,471 bases

Volumes:
/media/RAID/blastdb/nt.00
/media/RAID/blastdb/nt.01
/media/RAID/blastdb/nt.02
/media/RAID/blastdb/nt.03

...

/media/RAID/blastdb/nt.29

cat /media/RAID/blastdb/nt.nal
#
# Alias file created 06/17/2015 03:15:04
#
TITLE Nucleotide collection (nt)
DBLIST "nt.00" "nt.01" "nt.02" "nt.03" "nt.04" "nt.05" "nt.06" "nt.07" "nt.08" "nt.09" "nt.10" "nt.11" "nt.12" "nt.13" "nt.14" "nt.15" "nt.16" "nt.17" "nt.18" "nt.19" "nt.20" "nt.21" "nt.22" "nt.23" "nt.24" "nt.25" "nt.26" "nt.27" "nt.28" "nt.29"
NSEQ 30527720
LENGTH 95485076457

As I'm understanding - everything is ok with the database.

Might be "Nucleotide" database is not equal "nt" database? As I'm understanding (based on the descriptions of the databases) nt base should be equal or even bigger than Nucleotide one:

nt (the description obtained via "?" button in web version of blast)

"Title:Nucleotide collection (nt)
Description:The nucleotide collection consists of GenBank+EMBL+DDBJ+PDB+RefSeq sequences, but excludes EST, STS, GSS, WGS, TSA, patent sequences as well as phase 0, 1, and 2 HTGS sequences. The database is non-redundant. Identical sequences have been merged into one entry, while preserving the accession, GI, title and taxonomy information for each entry.
Molecule Type:mixed DNA
Update date:2015/07/14
Number of sequences:31076527"

Nucleotide (the description copypasted from main page of Nucleotide part of NCBI http://www.ncbi.nlm.nih.gov/nuccore )

"The Nucleotide database is a collection of sequences from several sources, including GenBank, RefSeq, TPA and PDB. Genome, gene and transcript sequence data provide the foundation for biomedical research and discovery."

As for nonredundancy of nt database there is some discrepancy with description of "downlodable" version of nt database on NCBI's ftp:

"

nt.*tar.gz                    | Partially non-redundant nucleotide sequences from
all traditional divisions of GenBank, EMBL, and DDBJ
excluding GSS,STS, PAT, EST, HTG, and WGS.

"

It seems "my" nt base could be non-redundant and according that should not be smaller than Nucleotide one. Consequently, I suppose that nt base should contain all entries from my gid list.

To be sure that everything is ok with my GI list I performed the same steps with another list of gids (taxonomy id: 2 from NCBI's Taxonomy Browser) and got the same result: most gids haven't been found in nt base (""Error: XXXXXXXXX: OID not found").

So, the questions are as following:

1) Is everything ok with my local nt base? Should I check smth else to be sure..?

2) Was my "restriction" pipeline wrong in some steps? Do you know more effective way how to perform search for big list of queries against nt-database-sequencies which belong to organisms from certain taxon? Restrictions are: queries are some mix of sequences from a number of non-model organisms.

3) Is it actually normal situation with getting the error " OID not found" if the pipeline correct? Should nt database contain all gids from my list(s)?

Would be very appreciative for any help.

Thank you for your attention )

PS:

The overall task is to perform "decontamination" of short-reads array before de-novo genome assembly step. We are currently have no idea about list of "contaminating" organisms, except of consideration that our target organism is eukaryotic and "contaminants" are prokaryotic. The amount of "contaminating" reads is quite big - roughly 30% or even more.. Might be there is some other effective way to perform the "decontamination"?

blast • 3.3k views
1
Entering edit mode
5.7 years ago
5heikki 9.7k

This is related to GI numbers not being for forever, at least not in any ncbi sequence databases. IMO the best way to get the relevant GI numbers is straight from the db itself as in this example. Then you just use that list for blastdb_aliastool. Also, there are much faster methods than blast for decontamination, e.g. DeconSeq.

0
Entering edit mode

Thank you for your answer. I saw previously the cookbook page but haven't tried it yet cause of some misunderstanding: for example, if I run the following command

blastdbcmd -db nt -entry all -outfmt "%g %T" | \
awk ' { if ($2 == 3041) { print$1 } } ' | \
blastdbcmd -db nr -entry_batch - -out chlorophyta_sequences.txt

will I get list of sequencies belonging to all organisms in phylum Chlorophyta (including, e.g. organisms in class Chlorodendrophyceae with Taxonomy ID 1524962) or just some subset? Do you know?

Also thank you for advise with software. I've slightly postponed to try other ways for decontamination than blast-way cause of the latter looks more clear and informative for me (who is my "contamination"). But seems you're right and blast-way much more ineffective in terms of speed. I've found the article http://www.nature.com/ismej/journal/vaop/ncurrent/full/ismej2015100a.html and now going to try the software as well as DeconSeq.

2
Entering edit mode

The example linked is not appropriate for your use-case, it will not select subtaxa but only the exact taxon. Therefore, you need to generate a list of all relevant subtaxa from the taxonomy and then filter the gi list for those.

1
Entering edit mode

You're right. I think the list can be fetched from here. Or with Entrez Direct:

esearch -db taxonomy -query txid2[Subtree] | efetch -format uid > txid.list

I would say in this case it would be better to:

blastdbcmd -db nt -entry all -outfmt "%g %T" > temp.file

and then (if enough memory):

join -1 1 -2 2 -o 2.1 -t $'\t' <(sort -k1,1 txid.list) <(sort -k2,2 temp.file) > bac.gi and then proceed with blastdb_aliastool ADD REPLY 0 Entering edit mode Thank you. I've used almost the same way with some modifications. Firstly I've manually updated nt db (Aug 12 version). Further: 1. to extract subtaxa ids I've used gi_taxid_nucl.dmp file (ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid.readme ) which is matching gi with txid 2. to get list of txids of Bacteria taxa I've used the file categories.dmp from taxcat dump (ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxcat_readme.txt ) grep '^B' taxcat_dmp_Aug2015.dmp | cut -f 3 > Bacteria_taxids.txt  3. to get full list of gis from nt db I've used blastdbcmd -db nt -entry all -outfmt "%g" > GI_list_nt  4. Cause of I'm not so cool to use join, small Perl script helped me to extract GIs from gi_taxid_nucl.dmp which have taxid presented in Bacteria_taxids.txt and further to check presence of such GIs in GI_list_nt... Seems should work.. But. The result of (4) was the following: 36539237 - GIs in GI_list_nt 6861060 - GI have been extracted..  Meanwhile using Nucleotide db at NCB web page and "txid2[Orgn]" query one could currently extract 19900899 GIs (http://www.ncbi.nlm.nih.gov/nuccore/?term=txid2%5BOrgn%5D ).. Whta is that? Moreover. My local nt db: blastdbcmd -db nt -info Database: Nucleotide collection (nt) 31,561,398 sequences; 101,581,910,962 total bases  If nt and Nucleotide dbs are almost the same then the dbs consists of more than 50% from bacterial seqs. But guess that can't be true. But even if not, I think that 6 mln of bacterial seqs from >30 mln of all nt's seqs is too small part... Isn't it? ADD REPLY 0 Entering edit mode Try it the way I spelled out and see how many sequences you get. Also, nucleotide and nt are not the same. nt is non-redundant to some extent. Also, if you check on the left from your last link, you'll see that Refseq has ca. 4M bacterial sequences. 7M bacterial seqs in nt might be about right.. Also consider the difference in count of bacteria and eukaryota in nuccore, 20M bacterial seqs, 155M eukaryota seqs: esearch -db nuccore -query txid2[Orgn] .. <Count>19900899</Count> .. esearch -db nuccore -query txid2759[Orgn] .. <Count>154846939</Count> .. The ratios 7/31 20/155 are not that far off and 155M still excludes archaea, viruses and unclassified. Total seq count in nuccore is right now 198,787,644 (einfo -db nuccore) ADD REPLY 0 Entering edit mode I have use this. And at the "blastdbcmd -db nr -entry all -outfmt "%g %T" > temp.file" I got a file. N/A 0 N/A 0 N/A 0  What is the problem? ADD REPLY 0 Entering edit mode 5.7 years ago Run update_blastdb.pl nt in the directory where your blast databases are stored and extract the tar files, then see if the problem persists. A fraction of oids might have been added between june 17 and today. ADD COMMENT 0 Entering edit mode Thank you for the answer. I've tried to use the script to download nt db from NCBI's and failed with the same error "nt not found". After that I downloaded the db manually. Now I tried to run it again: /media/RAID/blastdb$ update_blastdb nt
Connected to NCBI
nt not found, skipping.

but I have nt db in path... Don't know what is it...

And also: do you think that between June 17th and today they have added more than half of current sequences for Chlorophyta? Does it possible?

0
Entering edit mode

I do not know why update_blastdb nt doesn't work, it works fine for me. If you have downloaded the 32 volumes manually, you will still have to extract them, but you don't need update_blastdb

0
Entering edit mode

Unfortunately I can't check the script right now with my updated nt db. But the older version consisted of 29 volumes and worked in blast (seems the db wasn't corrupted). But not with update_blastdb.pl although the db wasn't up-to-date when I tried...

0
Entering edit mode

Have you solved your problem! If not please try to use -parse_seqids throughout the BLAST script, it makes tags for your ids in blast database which helps in parsing data using blastdbcmd. Thanks