Question

Unexpected BLAST results

1

Entering edit mode

8.1 years ago

shirani.s ▴ 10

I'm using this blastn command to find identity between a number of sequences. This is the command:

blastn -query IAM_O_DNA_0001_0001.txt -subject IAM_O_DNA_0001_0001.txt -outfmt 10 -max_target_seqs 100000 1>BLAST_O_0001_0001.txt

I'm basically blasting the same file against itself. The file contains 10 sequences each with 5842792 nucleotides. I'm expecting the output to be something like this:

1,1,100.00,5842792,0,0,1,5842792,1,5842792,0.0,1.079e+007 
1,2,100.00,5842792,0,0,1,5842792,1,5842792,0.0,1.079e+007

and since there are 10 sequences, it should be 100 lines (10 X 10). But the actual output is like this (first 6 lines):

1,1,100.00,5842792,0,0,1,5842792,1,5842792,0.0,1.079e+007
1,1,100.00,5629,0,0,2608960,2614588,2401317,2395689,0.0,10395
1,1,100.00,5629,0,0,2395689,2401317,2614588,2608960,0.0,10395
1,1,99.69,5214,14,2,3593811,3599023,1890756,1885544,0.0,9539
1,1,99.69,5214,14,2,1885544,1890756,3599023,3593811,0.0,9539
1,1,99.83,4594,8,0,5108830,5113423,1876943,1881536,0.0,8440

with thousands of lines. I'm not sure what's happening. The program works for smaller sequences (same number of sequence, less nucleotides). Any suggestions as to why this is happening and how to fix it?

blastn nucleotide output • 2.4k views

ADD COMMENT • link updated 8.1 years ago by h.mon 35k • written 8.1 years ago by shirani.s ▴ 10

0

Entering edit mode

If you are only interested in the "top" hit (blast will find other "local" matches and that is why you have all those additional entries) then have you considered limiting output by e-value?
If these are sequences that are similar (in composition/length) and you would like to get global alignments/similarities then consider using an alternate program like lastz or Needleman-Wunsch.

ADD REPLY • link 8.1 years ago by GenoMax 141k

0

Entering edit mode

Aren't are the e-value s 0.00 in the above output? Could you explain a bit more?
If this is the case, how come it works on other shorter alignments but not this one?

ADD REPLY • link 8.1 years ago by shirani.s ▴ 10

1

Entering edit mode

Since I did not see an -e-value limit in your command line (and I don't remember the BLASTn output tabular columns by heart) I had suggested limiting with -e-value. You are correct that the e-values for those smaller hits are also 0.0 (column 11) so that would not work.
Options here may then be limit by alignment length, increase the gap open penalties. Are these internal repeats?

ADD REPLY • link 8.1 years ago by GenoMax 141k

score 3 · Answer 1 · 2016-04-05

3

Entering edit mode

8.1 years ago

h.mon 35k

Your input is chromosome sized, I guess you are seeing duplications within / between your sequences. You may want to play with -max_target_seqs, -culling_limit, -qcov_hsp_perc, -perc_identity to get the output closer to what you need, in addition to -evalue as genomax2 suggested.

P.S.: but see this bug / feature of -max_target_seqs

ADD COMMENT • link 8.1 years ago by h.mon 35k

0

Entering edit mode

Thanks, this helped a lot. I used -qcov_hsp_perc100 here and worked fine for mine. I'm just going to explain further below for future reference. The problem in my output was that it is aligning genes in the middle of the alignment as well as the whole query. I needed a command to filter BLASTn outputs to whole alignment (i.e qstart 1, qend 5842729, sstart 1, send 5842729). Using -qcov_hsp_perc 100, I can tell BLAST to print the output covering the whole alignment (not the duplicate genes in between).

ADD REPLY • link 8.1 years ago by shirani.s ▴ 10