Insert sequence in nt database
2
0
Entering edit mode
2.7 years ago

Hello, I'm sure this isn't possible, but I want to clear my doubts. Is there a way to insert some sequences generated by my lab in nt database downloaded, without having to submit to ncbi? Thanks and sorry for this question.

nt • 1.7k views
2
Entering edit mode
2.7 years ago

makeblastdb -in mysequences.fna -dbtype nucl -title "some sequences I found" -out mysequences -parse_seqids
blastdb_aliastool -dblist nt mysequences -dbtype nucl -title "nt database + my own sequences" -out ntandmore


After that you can run for example:

 blastn -db ntandmore ...

0
Entering edit mode

So I did this command and I got 6 files, so now I shoul add them to the directory where I have the net database right?

0
Entering edit mode

I am not sure how many files you get, ls mysequences.* ntandmore.* should list all files that make the blast database. Those need to be in the BLASTDB directory, together with the nt database, except mysequeces.fast, the input fasta file. Make sure to also download and extract the nt database again, because it looks as if your previous actions might have overwritten its files. The best way of doing this is to simply change directory to $BLASTDB and generate everything there. Otherwise, doing a ´cp mysequences.* ntandmore.*$BLASTDB´ should also do it.

0
Entering edit mode
2.7 years ago

Assuming you are talking about BLAST nt db:

From top of my head I can think of following:

Edit: These are separate options to do what you've asked. Also it is not an complete foolproof tutorial, you need to check the manual for the tools.

1) you can unpack the nt to fasta (blastdbcmd) append the sequence, create new db (makeblastdb).

2) create db from new sequences and combine the database using blastdb_aliastool. (https://www.ncbi.nlm.nih.gov/books/NBK279693/)

1
Entering edit mode

Two would be the way to go. No point in trying re-make indexes for locally modified nt.

0
Entering edit mode

So I did the second command wich generated a .nal file is this correct?

0
Entering edit mode

Did you try aligning with the new name you created for the aliased DB?

0
Entering edit mode

So I did this command : blastdb_aliastool -dblist mysequence -dbtype nucl -title "test nt" -out nt ; and then I got a file named nt.nal, should I had this to the directory where I have all the nt database? I also name it nt.

1
Entering edit mode

-dblist needs to include both mysequence nt. I would not use nt as the -out name since that would mess-up existing nt.nal file. Please follow the example shown by @Michael below closely. Finally use the ntandmore (or the name you choose) for actual search.

0
Entering edit mode

Yes but I continue to have my doubts. Cause when I do the commands like @Michael I get 5 files. So my question is after that, how can I run a blast command with my sequences and the nt databases? If I name it like ntandmore, is this file the compilation of the nt database and my own sequences?

1
Entering edit mode

If I name it like ntandmore, is this file the compilation of the nt database and my own sequences?

Correct. @Michael did show you how to search with the new combined database alias with an example in the answer below.

Note: Make sure you did not mess up your original nt.nal file that described all parts of nt database, since you used the name -out nt for the combined database based on a post above.

0
Entering edit mode

thank you for the help @genomax. I have another question, is there any way to get the IDS of this sequences? Cause in my blast I narrow my search to include only some IDS.

0
Entering edit mode

If you created your own database with -parse_seqids option you should get the ID's of the sequence in your search.

0
Entering edit mode

That's a good point, I have added the switch to my code too.

0
Entering edit mode

Usually I perform with -taxidlist. I got the list of IDs from this script:get_species_taxids.sh. But since I added 3 new species to my Database I need the IDs generated... I can't do it with blastcmd because it only takes the ID and not the name...

0
Entering edit mode

Hi, it would be good if you keep the comment on the thread where it is related to. It certainly matters how you generate your own database. If it is generated with correct taxids, as you were used to, also the combined alias db will have correct taxids. I don't quite understand your problem, are you saying that you are trying to add sequences for species that do not have an entry in the NCBI taxonomy? But this seems to be a question that goes beyond what was originally asked.

0
Entering edit mode

Yes I want to add sequences for species that do not have an entry in NCBI taxonomy... I'm so sorry if I wasn't clear.

0
Entering edit mode

Isn't that is what you added using mysequences (or whatever you called your sequences)?

0
Entering edit mode

Im sorry I didn't understand what you said...

0
Entering edit mode

You had sequences you were interested in and you created a separate blast database for those. You then created a common alias for both nt and your sequences to search against both? Isn't that correct?

0
Entering edit mode

Yes thats correct, I now need the sequence ID lof the new insertions.

0
Entering edit mode

What is IDS?

0
Entering edit mode

I corrected the message above

0
Entering edit mode

I now need the sequence ID lof the new insertions

You did not actually insert the sequence into nt database. You searched against an alias that included both nt + your data. So the results you get by searching against this combined alias should have your ID's (they were different than what exists in nt correct?).

If you are not seeing them then it is possible that limit on how many alignments are reported in your results may be excluding hits from your data.

I don't know if the order of databases specified in the alias makes a difference (it may) so instead of

blastdb_aliastool -dblist nt mysequences -dbtype nucl -title "nt database + my own sequences" -out ntandmore


you may want to create an alias where your sequences are listed first in the alias.

blastdb_aliastool -dblist mysequences nt -dbtype nucl -title "nt database + my own sequences" -out ntandmore


See if that helps bring results from your ID's up first.

0
Entering edit mode

So this is a look of my sequences to introduce in nt database: >Seq_2 [organism=Pleoticus robustus] COX1 gene, parcial cds, here I don't put an ID... that's way I'm confused... where to get the ID of this sequence...

0
Entering edit mode

>Seq_2 [organism=Pleoticus robustus] COX1 gene, parcial cds


then Seq_2 would likely show up in your results.

0
Entering edit mode

So I don't think I did this right... 1) I did the blastdbcmd and then got 6 files with an extension similar to nt. 2) I did the blastdb_aliastool which generated a .nal file. So I copied this files (step 1 and 2) to my nt database directory. Previously I called nt to my database, by defining a variable in the windows system, should I do the same with this new update?

0
Entering edit mode

Looks like you did not do this right.

blastdbcmd is used to pull sequences out of a database. It they are already in nt then there is no point. If you are getting them from some other blast database then fine. Otherwise please follow two commands detailed by @Michael after you prepare/obtain a file containing your sequences of interest in multi-fasta format. Make sure they don't have identifiers that are already present in nt.

0
Entering edit mode

I just managed to add this sequences to my database, but now the blast output gives me NA instead of the specie... Can you help me?

0
Entering edit mode

Gives you NA for taxonomy?

Make sure you create a custom taxonomy file as described on this page and then use -taxid_map option with that file name.

0
Entering edit mode

I added that option however I continue to have NA... I created a custom taxonomy file like it was descibred in the page... and then added -taxid_map option to the file name in the makeblastdb option... Could it be the header of the fasta file?

0
Entering edit mode

Very likely. They may need to look like this.

>db|UniqueIdentifier|EntryName

0
Entering edit mode

0
Entering edit mode

I don't know if there is a new issue. We can't see your data or what you are doing to help further.

Please refer to this part of the thread so people don't ask you to do the same thing we have gone over again in your new question.

0
Entering edit mode

Ok the I will continue... So I continue to havet NA has a result to taxonomic file... Now I have changed my fasta header files like this: ref|NC_2345|Pleoticus robustu cytochrome c oxidase subunit I (COI) gene, parcial cds; mitochondrial The blast results gives NC_2345 has the ID however in the tax name it continues to return NA...

0
Entering edit mode

Which version of blast+ are you using? I may test this out if I can find some time.

0
Entering edit mode

I'm using blast 2.10.0 version. Thanks for the help...

0
Entering edit mode

I am able to get the taxID output in results for a custom blast database with blast v.2.10.0 using -outfmt '6 qseqid sseqid evalue bitscore sacc staxid'

0
Entering edit mode

Can you tell me the header of the sequences that you added?

0
Entering edit mode

I posted a detailed example in other thread: A: Create a costum taxonomy file