Following up on this question: http://biostar.stackexchange.com/questions/8010/searching-200-400bp-matches-against-mammalian-genome-human-mouse-distance
I would like to know what is the most sensitive method to search sequences ranging from 100bp to 1000bp against a whole genome assembly of another species, for genetic distances similar to human-mouse.
As an example, if I have human sequences ranging from 100-1000bps, what is the most sensitive tool and parameter options to find hits in the mouse genome?
I've been recommended to look at blast and blast+, but it would be great to hear about specific parameter settings or different options.
First, I must say that I don't quite understand the difference between this question and the previous one (for which you provided some decent blast parameters yourself). You are asking for 'the most sensitive' method, not for the fastest one. Thus, I would probably discount blat, ssaha and their ilk, you might consider discontinuous megablast, but I would probably stick to blast+ for a general-purpuse comparison. I am sure that at human-mouse distance, you will find >95% or all relevant similarities by discontiguous megablast or by a simple blastn with default parameters.
If you want to get the other 5%, the strategy depends on why they were missed in the first place.
If the input sequence was rather
short (<200bp), it might be advisable
to lower the word size from 11 to 9
or even 7.
If there are repeats involved (ERVs,
Alus, transposons, whatever), it is
possible that dust and/or repmasker
will have killed everthing alignable.
These cases are inherently difficult.
It might be necessary to switch off
the repeat masking, but this could
mean sifting through thousands of
If the sequences are just too
divergent, this could be another
tough problem. This can happen even
at human-mouse distance. If there is
the chance that your query sequence
has some coding bits, it might be
useful to also try tblastx.
In your situation I would probably set up a pipeline, first looking for matches by a fast and not-so-sensititve method. If nothing is found, I would probably try different things in parallel: using a shorter or longer word size, and also try using a different filter database (e.g. one that contains only repetitive elements found in both species). Repeats that are only present in one of the species will not generate spurious matches but might mask valuable sequence material in the query or target.
Albert, you did not mention what speed you intend to achieve. If you only have a couple of sequences, you can just use Smith-Waterman. BWT-SW is also a good choice for mammalian genomes. It has a similar speed to blast for short sequences while gives identical alignment to Smith-Waterman. It is frequently overlooked. BWA-SW can be a reasonable choice when speed is critical. You need to increase "-z" (e.g. 10 or even 100) to achieve high sensitivity. SSAHA2 is also a decent choice. The default setting would not work. You need to ask Zemin about the options.
For mammalian genomes, blast is too slow. Algorithms indexing the genome are usually much faster. I do not know if blast+ gives significant boost. Probably not.
EDIT: People should read the BWT-SW paper [PMID:18227115]. It performs the exact SW algorithm while is faster than blast in the 100-1000bp range. The BWA-SW paper [PMID:20080505] shows that SSAHA2 performs well for 10% divergence. I do not see why it cannot be tuned to work for sequences with the human-mouse divergence. BWA-SW is using the default settings. It may work well with a more sensitive configuration. I am not sure about this bit, though, as I have not tried. Also, modern SSE2/CUDA powered SW is approaching the speed of blast (not yet). PatternHunter is known to work better than blast for cross-species alignment. A few groups (I cannot remember which) have improved blast, though these are less known. A lot happened in the past 10 years in the area of sequence analysis, but blast has not been changed much so far as I know. These efforts are not in vain.
I guess Albert is mapping transcriptome contigs. I think he needs a fast aligner.
I would use the classical concept of synteny or othology for such cross-genome comparisons. As you are already advised BLAST+ / older version of BLAST(BLAST 2.2.) are good choice for both.
Depending up on the length of your sequence, you may have use the right matrices and word-lenghth. If you are using the latest version of BLAST+, most of these features are already inbuilt to the algorithm (See the FAQ here, hope this is applicable to CLI as well. ).
You can also consult specialized cross-genome comparison tools like Vista tools, Cinteny etc.
I would suggest MOSAIK aligner. It can acheive high sensitivity. It uses Smith-Waterman algorithm for alignemnet. Indexes both read and references and creates a jump database to speed up the alignment process. It has multiple options (for gap opening/extension; allowed mismatches; hash size...) to optimise the alignment process similar to that of blast. It also has option to use the IUPAC ambiguity codes during alignment.