Question

blastn for short sequences

4

Entering edit mode

10.6 years ago

User6891 ▴ 340

Hi everyone,

I was trying to use blast to map short sequences (between 15 - 30 nucleotides long) to the human reference genome

I have blast locally installed:

these are the options that I used:

blastn -dust no -db ucsc.h19.fasta -outfmt 7 -word_size 7 -evalue 1000 -perc_identity 100 -ungapped -query .. -out ...

I still get way to many matches for my input, while there is in fact only one match in the whole genome. What I see in the output is the good match as the first line, which is off course already good.

However then there are like 1000 matches following, where the percentage identity is 100, but they are never exactly the same match as my query sequence. Lets say my query sequence is 20 bp long, I only want to get the match where the alignment length is 20 and the q.start = 1 & q.end is 20. And this is only the case for the first line in my output. The rest of the lines never has an alignment length of 20. Is there a way to only output the matches where the alignment length is the same as the length of my query?

I can always filter my output afterwards, but I have a lot of sequences to blast, so I would like to get it as clean as possible from the beginning.

blast • 13k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by User6891 ▴ 340

Ram · Answer 1 · 2014-11-20

3

Entering edit mode

10.6 years ago

Siva ★ 1.9k

You can try the -task option and set it to blastn-short which is optimized for sequences less than 30 nucleotides.

It uses slightly different default values for some options:

-word_size = 7 (which you used)
-reward = 1
-penalty = -3

For controlling the BLAST hits, can you try the option -qcov_hsp_perc and set it to 100?

From blastn -help

-qcov_hsp_perc <Real, 0..100>
   Percent query coverage per hsp

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by Siva ★ 1.9k

0

Entering edit mode

Thanks this -qcov_hsp_perc <Real, 0..100> might do the trick, but it's only present starting from version 2.2.30 released in October 2014. So I will have to ask for an update on our servers

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by User6891 ▴ 340

score 1 · Answer 2 · 2014-11-20

A good question that I would also like to know the answer for :)

I usually use the -max_target_seqs switch, with a low value (1-3), to reduce the number of hits and parse those out myself. Often the first hit is also the best alignment, though I double check the interesting hits for true positives.

Use with caution :)