Question: Unexpected BLAST results
1
gravatar for shirani.s
3.1 years ago by
shirani.s10
shirani.s10 wrote:

I'm using this blastn command to find identity between a number of sequences. This is the command:

blastn -query IAM_O_DNA_0001_0001.txt -subject IAM_O_DNA_0001_0001.txt -outfmt 10 -max_target_seqs 100000 1>BLAST_O_0001_0001.txt

I'm basically blasting the same file against itself. The file contains 10 sequences each with 5842792 nucleotides. I'm expecting the output to be something like this:

1,1,100.00,5842792,0,0,1,5842792,1,5842792,0.0,1.079e+007 
1,2,100.00,5842792,0,0,1,5842792,1,5842792,0.0,1.079e+007

and since there are 10 sequences, it should be 100 lines (10 X 10). But the actual output is like this (first 6 lines):

1,1,100.00,5842792,0,0,1,5842792,1,5842792,0.0,1.079e+007
1,1,100.00,5629,0,0,2608960,2614588,2401317,2395689,0.0,10395
1,1,100.00,5629,0,0,2395689,2401317,2614588,2608960,0.0,10395
1,1,99.69,5214,14,2,3593811,3599023,1890756,1885544,0.0,9539
1,1,99.69,5214,14,2,1885544,1890756,3599023,3593811,0.0,9539
1,1,99.83,4594,8,0,5108830,5113423,1876943,1881536,0.0,8440

with thousands of lines. I'm not sure what's happening. The program works for smaller sequences (same number of sequence, less nucleotides). Any suggestions as to why this is happening and how to fix it?

output blastn nucleotide • 1.1k views
ADD COMMENTlink modified 3.1 years ago by h.mon24k • written 3.1 years ago by shirani.s10

If you are only interested in the "top" hit (blast will find other "local" matches and that is why you have all those additional entries) then have you considered limiting output by e-value?
If these are sequences that are similar (in composition/length) and you would like to get global alignments/similarities then consider using an alternate program like lastz or Needleman-Wunsch.

ADD REPLYlink written 3.1 years ago by genomax65k
  • Aren't are the e-value s 0.00 in the above output? Could you explain a bit more?
  • If this is the case, how come it works on other shorter alignments but not this one?
ADD REPLYlink written 3.1 years ago by shirani.s10
1

Since I did not see an -e-value limit in your command line (and I don't remember the BLASTn output tabular columns by heart) I had suggested limiting with -e-value. You are correct that the e-values for those smaller hits are also 0.0 (column 11) so that would not work.
Options here may then be limit by alignment length, increase the gap open penalties. Are these internal repeats?

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by genomax65k
3
gravatar for h.mon
3.1 years ago by
h.mon24k
Brazil
h.mon24k wrote:

Your input is chromosome sized, I guess you are seeing duplications within / between your sequences. You may want to play with -max_target_seqs, -culling_limit, -qcov_hsp_perc, -perc_identity to get the output closer to what you need, in addition to -evalue as genomax2 suggested.

P.S.: but see this bug / feature of -max_target_seqs

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by h.mon24k

Thanks, this helped a lot. I used -qcov_hsp_perc100 here and worked fine for mine. I'm just going to explain further below for future reference. The problem in my output was that it is aligning genes in the middle of the alignment as well as the whole query. I needed a command to filter BLASTn outputs to whole alignment (i.e qstart 1, qend 5842729, sstart 1, send 5842729). Using -qcov_hsp_perc 100, I can tell BLAST to print the output covering the whole alignment (not the duplicate genes in between).

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by shirani.s10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1526 users visited in the last hour