Blast a sequence with N's
1
0
Entering edit mode
10 weeks ago

Dear all

I'm blasting some sequences on nt database, on my local server. My sequences are short (<100 bp) with this caracteristic:

seq 40N seq

Where seq ~28 bp

my command is the following:

blastn -task blastn-short -evalue 0.1 -qcov_hsp_perc 60 -word_size 7 -dust no -reward 2 -penalty -3 -gapopen 5 -gapextend 2 -query FASTA -db nt  Blast don't found any match in the databses. But if i run the same command with the "-remote" option, i get back som site of interest. Someone know why this dfference? I revise my databse and it contains all of the founded sequences. If the seq ~38 bp, and the size of the all sequence is more large than 100bp, blastn local found the same as remote running. Someone can explian me what's the problem? Regards! sequence blastn Short • 706 views ADD COMMENT 0 Entering edit mode are you using exactly the same version of the nt DB ? for the matches you get back from the remote search, do they comply with the filters you have in the cmdline? (eg; is the evalue of the remote hits not smaller than for the local one?) ADD REPLY 0 Entering edit mode Dear Lieven My local database is from January 27th 2021, the remote is May 14th 2021. I don't think that could be the explanation. I revise that some hits founded in remote are present in my local DB. The evalue filter for the 2 commands are the same: 0.1 Should i put a smaller one? ADD REPLY 0 Entering edit mode I'm not saying it is but I would also not so rapidly conclude it is not related to the database. They change quickly and with adding/removing sequences the e-values of the hits changes as well. So best to double-check anyway. take one of the hits of the remote blast and look it up in the local blast result (might need to re-run it with a less stringent Evalue) and see what e-value it gets in the local search ADD REPLY 0 Entering edit mode What is the significance of N? Are there actual N's or are they representing variable sequence? ADD REPLY 0 Entering edit mode Dear GenoMax, There are N as we don't care exactly about with nuc are inside que 2 sequences: For example: ATCGTAGCTNNNNNNNNNNNNNNNNNNNNNNNNATCGTAGCT Regards ADD REPLY 0 Entering edit mode Is the spacing important? Otherwise you could simply use the two sequences at the end and do searches that way. ADD REPLY 0 Entering edit mode Hi Geno. Yes, the space between each sequence matters. As the sequence are the same, i could do like this right. And after search the space beetwen each result to check if it's a hit. More job than with -remote option. The problem here is that we got a lot of sequences like this. And the reote otion limit to hundred search / day. Thank's ADD REPLY 1 Entering edit mode well, I'm not fully convinced of your statement here. Since you already indicate that you are limited to 100 searches / day using -remote , you will in the end be quicker and more efficient checking distance after doing search. Search only for one of the short sequences (ends are identical you said?) Get the tabular output, parse/filter that one, I can see this going fairly straightforward tbh. Moreover, even with the approach you try to follow here you can never be sure that the hit that blasts returns will have exactly 40nt spacer between them, blast might gap the alignment or report two HSPs without the spacer. Bottom line if the spacer is so crucial you need to double check that anyway. ADD REPLY 1 Entering edit mode Dear Lieven You're right. The tabular output have to be used in this case (Really, i'm heping someone else that ask me about this issue :-). I will propose to him this better method: i don't realiza yet the possibility of some gaps in the N's. But anyway, i will send some request to NCBI just to understand what's wrong between this 2 method. I'm just waiting about get the actualized nt to run on local. In any case i'll send you their explanation if there is one. Thank's a lot! ADD REPLY 0 Entering edit mode While I still stand by my idea above, I accidentally came across this paragraph from the NCBI blast guides: A common use of this page is to check the specificity of PCR or hybridization primers. A useful way to check a pair of PCRprimers is to first concatenate them by inserting string of 20 or more N's in between the two primers, and then search theconcatenated pair as one sequence. Since BLAST looks for local alignments and automatically searches both strands, there is no need to reverse complement the reverse primer before doing the concatenation or the search just to say your approach was not that bad :) , though the case described above is not exactly your use case. ADD REPLY 0 Entering edit mode you might also consider to run the normal blastn (task -blastn) , the short version is actually for sequences below 50nt, which yours are not it seems. If you do then set the word-size to less than the default 11 . ADD REPLY 0 Entering edit mode Dear lieven. No hits found with -task blastn command. That should be normal, as the sequence "seq" is less than 50 bp, no? regards ADD REPLY 0 Entering edit mode dunno, you have 2x 28 + 40 , sounds like more than 50 ;) but ok, no hits thus. (also not with the -remote) ? ADD REPLY 0 Entering edit mode Ok, sorry. You're right. That's more than 50. As the N's mean at least one nucleotide.. The -remote give me back hits... Thank's! ADD REPLY 0 Entering edit mode are you requesting default alignment output from blast? if so can you check and post here for both the remote and local blast the last couples of line in it (all at the bottom), it should say things like 'effective database size' and such . ADD REPLY 0 Entering edit mode My BD is 3 months old. Not sure that could explain the problem: For the local search: Lambda K H 0.634 0.408 0.912 Gapped Lambda K H 0.625 0.410 0.780 Effective search space used: 1816805494865 Database: Nucleotide collection (nt) Posted date: Jan 27, 2021 4:44 PM Number of letters in database: 365,495,364,989 Number of sequences in database: 66,695,813 Matrix: blastn matrix 2 -3 Gap Penalties: Existence: 5, Extension: 2  For the remote search : Lambda K H 0.634 0.408 0.912 Gapped Lambda K H 0.625 0.410 0.780 Effective search space used: 0 Database: Nucleotide collection (nt) Posted date: May 14, 2021 10:43 AM Number of letters in database: 440,843,297,299 Number of sequences in database: 70,303,946 Matrix: blastn matrix 2 -3 Gap Penalties: Existence: 5, Extension: 2  ADD REPLY 0 Entering edit mode not showing any discrepancies what I hoped for but you can still see that the database itself has changed significantly over those past months. That itself should not be the cause of the issue here (I think) but it will have an effect on the evalues being reported so currently my best guess is on that. Did you had a look already at the check I proposed here above somewhere? (check one hit in detail?) ADD REPLY 0 Entering edit mode Hi Lieven. I'm waiting for the ultimate chunck of the nt databases to download to run the blastn with the more recent dabatase. And yes i will do some sutdy about searching each of sequence, with tabular output, and do some "simplistic" mathematics to detect the real hits we are searching. Thank's a lot about all of your coments. ADD REPLY 0 Entering edit mode fair enough. (though you could also do that check with the old local DB, but I can understand you'll wait for the update nt ) if you are only interested in identical matches with fixed spacer you could consider doing some simple reg-ex matching (I just thought of) one of the top hits doing a google search gives me this: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2637886/ perhaps worth considering? ADD REPLY 1 Entering edit mode 10 weeks ago I've got the explanation from NCBI. The problem was the qcov_hsp_perc parameter, as follow: It appears that you do not fully understand on of the parameter setting's meaning, "-qcov_hsp_perc 60" to be more specific. You are trying to force a search parameter setting between web and standalone. For many detailed aspects, such as detailed search settings, there is NO direct comparison. For web blast, submitted through -remote, my understanding is that for -remote submitted searches, NCBI blast server will ignore parameters it does not use. In this case if returned result since " " option is NOT used by the web. Doing your remote search, but using a customized tabular output, you can see that NONE of the hits (HSPs) covers the query more thant 40% - as indicated by last column:  blastn -task blastn -evalue 0.1 -qcov_hsp_perc 60 -word_size 7 -dust no -reward 2 -penalty -3 -gapopen 5 -gapextend 2 -query Q -db nt  -outfmt '7 std qcovhsp'  -out output_7.txt  -remote&

# BLASTN 2.11.0+
# Query: seq1
# RID: ACJKM90V013
# Database: nt
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score, % query coverage per hsp
# 1724 hits found
seq1    CP074120.1      100.000 28      0       0       61      88      1337192 1337219 0.006   51.8    32
seq1    CP065024.1      100.000 28      0       0       61      88      3189147 3189174 0.006   51.8    32
seq1    CP034658.1      100.000 28      0       0       61      88      892034  892007  0.006   51.8    32
seq1    CP047010.1      100.000 28      0       0       61      88      1472454 1472427 0.006   51.8    32
seq1    CP047002.1      100.000 28      0       0       61      88      4804955 4804928 0.006   51.8    32


In standalone, this is being used and all hits are filtered out as they should. If you do want to find hits, drop this parameter.

Regards,

Tao Tao, PhD NCBI User Services

Thank's a lot for all of you that answer me! Regards

0
Entering edit mode

jerome.verleyen : It is not clear if everything in the post above after as follow is a direct copy/paste of response from NCBI. If you added your own comments can you separate them out with quote formatting tool. It is the button to the left of 101010 button in the edit bar. Your command had no " so I am not sure how that relates to the response above.