Insert sequence in nt database
2
0
Entering edit mode
4.5 years ago

Hello, I'm sure this isn't possible, but I want to clear my doubts. Is there a way to insert some sequences generated by my lab in nt database downloaded, without having to submit to ncbi? Thanks and sorry for this question.

nt • 3.5k views
ADD COMMENT
2
Entering edit mode
4.5 years ago
Michael 55k

Given you have nt downloaded and in the BLASTDB path, and your additional sequences are in mysequences.fna in your working directory, the following should work:

makeblastdb -in mysequences.fna -dbtype nucl -title "some sequences I found" -out mysequences -parse_seqids
blastdb_aliastool -dblist nt mysequences -dbtype nucl -title "nt database + my own sequences" -out ntandmore

After that you can run for example:

 blastn -db ntandmore ...
ADD COMMENT
0
Entering edit mode

So I did this command and I got 6 files, so now I shoul add them to the directory where I have the net database right?

ADD REPLY
0
Entering edit mode

I am not sure how many files you get, ls mysequences.* ntandmore.* should list all files that make the blast database. Those need to be in the BLASTDB directory, together with the nt database, except mysequeces.fast, the input fasta file. Make sure to also download and extract the nt database again, because it looks as if your previous actions might have overwritten its files. The best way of doing this is to simply change directory to $BLASTDB and generate everything there.

Otherwise, doing a ´cp mysequences.* ntandmore.* $BLASTDB´ should also do it.

ADD REPLY
0
Entering edit mode
4.5 years ago

Assuming you are talking about BLAST nt db:

From top of my head I can think of following:

Edit: These are separate options to do what you've asked. Also it is not an complete foolproof tutorial, you need to check the manual for the tools.

1) you can unpack the nt to fasta (blastdbcmd) append the sequence, create new db (makeblastdb).

2) create db from new sequences and combine the database using blastdb_aliastool. (https://www.ncbi.nlm.nih.gov/books/NBK279693/)

ADD COMMENT
1
Entering edit mode

Two would be the way to go. No point in trying re-make indexes for locally modified nt.

ADD REPLY
0
Entering edit mode

So I did the second command wich generated a .nal file is this correct?

ADD REPLY
0
Entering edit mode

Did you try aligning with the new name you created for the aliased DB?

ADD REPLY
0
Entering edit mode

So I did this command : blastdb_aliastool -dblist mysequence -dbtype nucl -title "test nt" -out nt ; and then I got a file named nt.nal, should I had this to the directory where I have all the nt database? I also name it nt.

ADD REPLY
1
Entering edit mode

-dblist needs to include both mysequence nt. I would not use nt as the -out name since that would mess-up existing nt.nal file. Please follow the example shown by @Michael below closely. Finally use the ntandmore (or the name you choose) for actual search.

ADD REPLY
0
Entering edit mode

Yes but I continue to have my doubts. Cause when I do the commands like @Michael I get 5 files. So my question is after that, how can I run a blast command with my sequences and the nt databases? If I name it like ntandmore, is this file the compilation of the nt database and my own sequences?

ADD REPLY
1
Entering edit mode

If I name it like ntandmore, is this file the compilation of the nt database and my own sequences?

Correct. @Michael did show you how to search with the new combined database alias with an example in the answer below.

Note: Make sure you did not mess up your original nt.nal file that described all parts of nt database, since you used the name -out nt for the combined database based on a post above.

ADD REPLY
0
Entering edit mode

thank you for the help @genomax. I have another question, is there any way to get the IDS of this sequences? Cause in my blast I narrow my search to include only some IDS.

ADD REPLY
0
Entering edit mode

If you created your own database with -parse_seqids option you should get the ID's of the sequence in your search.

ADD REPLY
0
Entering edit mode

That's a good point, I have added the switch to my code too.

ADD REPLY
0
Entering edit mode

Usually I perform with -taxidlist. I got the list of IDs from this script:get_species_taxids.sh. But since I added 3 new species to my Database I need the IDs generated... I can't do it with blastcmd because it only takes the ID and not the name...

ADD REPLY
0
Entering edit mode

Hi, it would be good if you keep the comment on the thread where it is related to. It certainly matters how you generate your own database. If it is generated with correct taxids, as you were used to, also the combined alias db will have correct taxids. I don't quite understand your problem, are you saying that you are trying to add sequences for species that do not have an entry in the NCBI taxonomy? But this seems to be a question that goes beyond what was originally asked.

ADD REPLY
0
Entering edit mode

Yes I want to add sequences for species that do not have an entry in NCBI taxonomy... I'm so sorry if I wasn't clear.

ADD REPLY
0
Entering edit mode

Isn't that is what you added using mysequences (or whatever you called your sequences)?

ADD REPLY
0
Entering edit mode

Im sorry I didn't understand what you said...

ADD REPLY
0
Entering edit mode

You had sequences you were interested in and you created a separate blast database for those. You then created a common alias for both nt and your sequences to search against both? Isn't that correct?

ADD REPLY
0
Entering edit mode

Yes thats correct, I now need the sequence ID lof the new insertions.

ADD REPLY
0
Entering edit mode

What is IDS?

ADD REPLY
0
Entering edit mode

I corrected the message above

ADD REPLY
0
Entering edit mode

I now need the sequence ID lof the new insertions

You did not actually insert the sequence into nt database. You searched against an alias that included both nt + your data. So the results you get by searching against this combined alias should have your ID's (they were different than what exists in nt correct?).

If you are not seeing them then it is possible that limit on how many alignments are reported in your results may be excluding hits from your data.

I don't know if the order of databases specified in the alias makes a difference (it may) so instead of

blastdb_aliastool -dblist nt mysequences -dbtype nucl -title "nt database + my own sequences" -out ntandmore

you may want to create an alias where your sequences are listed first in the alias.

blastdb_aliastool -dblist mysequences nt -dbtype nucl -title "nt database + my own sequences" -out ntandmore

See if that helps bring results from your ID's up first.

ADD REPLY
0
Entering edit mode

So this is a look of my sequences to introduce in nt database: >Seq_2 [organism=Pleoticus robustus] COX1 gene, parcial cds, here I don't put an ID... that's way I'm confused... where to get the ID of this sequence...

ADD REPLY
0
Entering edit mode

If the fasta header of your sequence is

>Seq_2 [organism=Pleoticus robustus] COX1 gene, parcial cds

then Seq_2 would likely show up in your results.

ADD REPLY
0
Entering edit mode

So I don't think I did this right... 1) I did the blastdbcmd and then got 6 files with an extension similar to nt. 2) I did the blastdb_aliastool which generated a .nal file. So I copied this files (step 1 and 2) to my nt database directory. Previously I called nt to my database, by defining a variable in the windows system, should I do the same with this new update?

ADD REPLY
0
Entering edit mode

Looks like you did not do this right.

blastdbcmd is used to pull sequences out of a database. It they are already in nt then there is no point. If you are getting them from some other blast database then fine. Otherwise please follow two commands detailed by @Michael after you prepare/obtain a file containing your sequences of interest in multi-fasta format. Make sure they don't have identifiers that are already present in nt.

ADD REPLY
0
Entering edit mode

I just managed to add this sequences to my database, but now the blast output gives me NA instead of the specie... Can you help me?

ADD REPLY
0
Entering edit mode

Gives you NA for taxonomy?

Make sure you create a custom taxonomy file as described on this page and then use -taxid_map option with that file name.

ADD REPLY
0
Entering edit mode

I added that option however I continue to have NA... I created a custom taxonomy file like it was descibred in the page... and then added -taxid_map option to the file name in the makeblastdb option... Could it be the header of the fasta file?

ADD REPLY
0
Entering edit mode

Very likely. They may need to look like this.

>db|UniqueIdentifier|EntryName
ADD REPLY
0
Entering edit mode

Im creating a new question about this issue

ADD REPLY
0
Entering edit mode

I don't know if there is a new issue. We can't see your data or what you are doing to help further.

Please refer to this part of the thread so people don't ask you to do the same thing we have gone over again in your new question.

ADD REPLY
0
Entering edit mode

Ok the I will continue... So I continue to havet NA has a result to taxonomic file... Now I have changed my fasta header files like this: ref|NC_2345|Pleoticus robustu cytochrome c oxidase subunit I (COI) gene, parcial cds; mitochondrial The blast results gives NC_2345 has the ID however in the tax name it continues to return NA...

ADD REPLY
0
Entering edit mode

Which version of blast+ are you using? I may test this out if I can find some time.

ADD REPLY
0
Entering edit mode

I'm using blast 2.10.0 version. Thanks for the help...

ADD REPLY
0
Entering edit mode

I am able to get the taxID output in results for a custom blast database with blast v.2.10.0 using -outfmt '6 qseqid sseqid evalue bitscore sacc staxid'

ADD REPLY
0
Entering edit mode

Can you tell me the header of the sequences that you added?

ADD REPLY
0
Entering edit mode

I posted a detailed example in other thread: A: Create a costum taxonomy file

ADD REPLY

Login before adding your answer.

Traffic: 820 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6