I was trying to use blast to map short sequences (between 15 - 30 nucleotides long) to the human reference genome
I have blast locally installed:
these are the options that I used:
blastn -dust no -db ucsc.h19.fasta -outfmt 7 -word_size 7 -evalue 1000 -perc_identity 100 -ungapped -query .. -out ...
I still get way to many matches for my input, while there is in fact only one match in the whole genome. What I see in the output is the good match as the first line, which is off course already good.
However then there are like 1000 matches following, where the percentage identity is 100, but they are never exactly the same match as my query sequence. Lets say my query sequence is 20 bp long, I only want to get the match where the alignment length is 20 and the q.start = 1 & q.end is 20. And this is only the case for the first line in my output. The rest of the lines never has an alignment length of 20. Is there a way to only output the matches where the alignment length is the same as the length of my query?
I can always filter my output afterwards, but I have a lot of sequences to blast, so I would like to get it as clean as possible from the beginning.