I'm new with Biostars - so, sorry if anything.
I'm also new with local blast and my question might be a bit "stupid" but I've already spent almost two days trying to solve the problem and totally have know idea how to do that... I've already learned BLAST manual pages, googled the subject and read several relevant topics in here and on Seqanswers as well, but haven't got any success.. I would be very appreciative for any help.
So, the subject is: I'm trying to "restrict" nt BLAST database to perform search of my queries (several millions of short reads) against sequences which belong only to certain taxon(s). My pipeline to do that was based on this topic Vertebrate Subset Nr Database? Build My Own? and included such steps:
1) getting id of taxon of interest using NCBI Taxonomy Browser. E.g. it was taxon Chlorophyta. According to the browser the taxon has id as 3041 ( http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=3041&lvl=3&lin=f&keep=1&srchmode=1&unlock )
2) getting GI list of genes associated with the taxon using NCBI Nucleotide database. I've performed search with the phrase "txid3041[Orgn]" in Nucleotide part of NCBI. It said "Found 1052995 nucleotide sequences. Nucleotide (427919) EST (569265) GSS (55811)". Ok, I've exported the GI list via "Send To" -> "File" -> "GI List" and got text file Chlorophyta_nucleotide.gi.txt with list of numbers (guess gene IDs (gids)):
916534476 916534474 916534472 916534470 916534468 916534466
3) Further using blastcmd I've tried to extract from my local nt base sequences matching to the list of gids with the following command:
blastdbcmd -db nt -dbtype 'nucl' -entry_batch ../Chlorophyta_nucleotide.gi.txt -out nt_tst.fa
BUT got a lot of errors like:
Error: XXXXXXXXX: OID not found
Second attempt was with redirecting of errors into separate file error.log (-logfile option)
The result is:
427919 - lines in file with gids (Chlorophyta_nucleotide.gi.txt) - it's in agreement with the report of search in Nucleotide database (see above)
287257 - lines in file error.log (all of them looks like "Error: XXXXXXXXX: OID not found")
140662 - lines starting with ">" in output file nt_tst.fa
140662+287257=427919 - seems that far from all entries in list of gids have been found in nt database
After some time with google I've learn that the problem could be connected with preparing of local nt database: if you are using manually prepared nt database (downloading fasta files with nt base followed by preparing of the base with makeblastdb script) it's important to use -parse_id option to escape the problem "OID not found" in future.
But I downloaded pre-formatted version of nt base from NSBI's ftp. According to their README pre-formatted nt base don't need further preparation before making a search (i.e. it's "ready-to-use"). Anyway to be sure that my nt base in "good shape" I performed simple blastn - it works.
blastdbcmd -db nt -info Database: Nucleotide collection (nt) 30,527,720 sequences; 95,485,076,457 total bases Date: Jun 17, 2015 3:03 AM Longest sequence: 774,434,471 bases Volumes: /media/RAID/blastdb/nt.00 /media/RAID/blastdb/nt.01 /media/RAID/blastdb/nt.02 /media/RAID/blastdb/nt.03 ... /media/RAID/blastdb/nt.29
cat /media/RAID/blastdb/nt.nal # # Alias file created 06/17/2015 03:15:04 # TITLE Nucleotide collection (nt) DBLIST "nt.00" "nt.01" "nt.02" "nt.03" "nt.04" "nt.05" "nt.06" "nt.07" "nt.08" "nt.09" "nt.10" "nt.11" "nt.12" "nt.13" "nt.14" "nt.15" "nt.16" "nt.17" "nt.18" "nt.19" "nt.20" "nt.21" "nt.22" "nt.23" "nt.24" "nt.25" "nt.26" "nt.27" "nt.28" "nt.29" NSEQ 30527720 LENGTH 95485076457
As I'm understanding - everything is ok with the database.
Might be "Nucleotide" database is not equal "nt" database? As I'm understanding (based on the descriptions of the databases) nt base should be equal or even bigger than Nucleotide one:
nt (the description obtained via "?" button in web version of blast)
"Title:Nucleotide collection (nt)
Description:The nucleotide collection consists of GenBank+EMBL+DDBJ+PDB+RefSeq sequences, but excludes EST, STS, GSS, WGS, TSA, patent sequences as well as phase 0, 1, and 2 HTGS sequences. The database is non-redundant. Identical sequences have been merged into one entry, while preserving the accession, GI, title and taxonomy information for each entry.
Molecule Type:mixed DNA
Number of sequences:31076527"
Nucleotide (the description copypasted from main page of Nucleotide part of NCBI http://www.ncbi.nlm.nih.gov/nuccore )
"The Nucleotide database is a collection of sequences from several sources, including GenBank, RefSeq, TPA and PDB. Genome, gene and transcript sequence data provide the foundation for biomedical research and discovery."
As for nonredundancy of nt database there is some discrepancy with description of "downlodable" version of nt database on NCBI's ftp:
nt.*tar.gz | Partially non-redundant nucleotide sequences from all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS,STS, PAT, EST, HTG, and WGS.
It seems "my" nt base could be non-redundant and according that should not be smaller than Nucleotide one. Consequently, I suppose that nt base should contain all entries from my gid list.
To be sure that everything is ok with my GI list I performed the same steps with another list of gids (taxonomy id: 2 from NCBI's Taxonomy Browser) and got the same result: most gids haven't been found in nt base (""Error: XXXXXXXXX: OID not found").
So, the questions are as following:
1) Is everything ok with my local nt base? Should I check smth else to be sure..?
2) Was my "restriction" pipeline wrong in some steps? Do you know more effective way how to perform search for big list of queries against nt-database-sequencies which belong to organisms from certain taxon? Restrictions are: queries are some mix of sequences from a number of non-model organisms.
3) Is it actually normal situation with getting the error " OID not found" if the pipeline correct? Should nt database contain all gids from my list(s)?
Would be very appreciative for any help.
Thank you for your attention )
The overall task is to perform "decontamination" of short-reads array before de-novo genome assembly step. We are currently have no idea about list of "contaminating" organisms, except of consideration that our target organism is eukaryotic and "contaminants" are prokaryotic. The amount of "contaminating" reads is quite big - roughly 30% or even more.. Might be there is some other effective way to perform the "decontamination"?