I'm using this blastn command to find identity between a number of sequences. This is the command:
blastn -query IAM_O_DNA_0001_0001.txt -subject IAM_O_DNA_0001_0001.txt -outfmt 10 -max_target_seqs 100000 1>BLAST_O_0001_0001.txt
I'm basically blasting the same file against itself. The file contains 10 sequences each with 5842792 nucleotides. I'm expecting the output to be something like this:
1,1,100.00,5842792,0,0,1,5842792,1,5842792,0.0,1.079e+007
1,2,100.00,5842792,0,0,1,5842792,1,5842792,0.0,1.079e+007
and since there are 10 sequences, it should be 100 lines (10 X 10). But the actual output is like this (first 6 lines):
1,1,100.00,5842792,0,0,1,5842792,1,5842792,0.0,1.079e+007
1,1,100.00,5629,0,0,2608960,2614588,2401317,2395689,0.0,10395
1,1,100.00,5629,0,0,2395689,2401317,2614588,2608960,0.0,10395
1,1,99.69,5214,14,2,3593811,3599023,1890756,1885544,0.0,9539
1,1,99.69,5214,14,2,1885544,1890756,3599023,3593811,0.0,9539
1,1,99.83,4594,8,0,5108830,5113423,1876943,1881536,0.0,8440
with thousands of lines. I'm not sure what's happening. The program works for smaller sequences (same number of sequence, less nucleotides). Any suggestions as to why this is happening and how to fix it?
If you are only interested in the "top" hit (blast will find other "local" matches and that is why you have all those additional entries) then have you considered limiting output by e-value?
If these are sequences that are similar (in composition/length) and you would like to get global alignments/similarities then consider using an alternate program like lastz or Needleman-Wunsch.
Since I did not see an
-e-value
limit in your command line (and I don't remember the BLASTn output tabular columns by heart) I had suggested limiting with -e-value. You are correct that the e-values for those smaller hits are also 0.0 (column 11) so that would not work.Options here may then be limit by alignment length, increase the gap open penalties. Are these internal repeats?