Forum:Is there really a difference in blastn -task blastn and -task megablast when dealing with unmapped reads?
0
0
Entering edit mode
12 months ago
DNAngel ▴ 210

Hi all,

I've been running a variety of blastn options with my metagenomic dataset with the goal of trying to find the best hits for my contigs (produced from metaspades). I've recently come to a question for myself that I wanted to discuss with you all. When dealing with metagenomic data (in my case, I am working with all unmapped reads of a dataset in the hopes of figuring out what those reads might be mapping too), does it REALLY make a difference when using BLASTN -task option 'megablast' or 'blastn'.

Megablast is define as the faster blast algorithm when comparing reads within the same species. Should produce better quality hits,etc (uses a larger word-size of 28). blastn is better when you are blasting hits from different species (uses a smaller word size of 11). Megablast would be best when you want to, say for example, confirm your sequences are in fact your species of interest, or you are comparing different strains of a bacteria and need even more precise blast results within a family of bacteria.

But I guess my final question here is, when using unmapped reads/metagenomic/environmental data, it already implies that I am not trying to validate my sequences to a KNOWN species but I am trying to simply determine what these mixtures of reads could be - would it make sense to still use megablast or really there is no difference in this case between megablast and blastn?

Has anyone really found a major difference between the two when dealing with metagenomic data? Just curious. I will be running both regardless to see what I have but it is taking a while (a few days to finish) so I figured I would start this discussion in the meantime.

blastn • 1.1k views
0
Entering edit mode

Have you tried clustering your unmapped reads with e.g. cd-hit or vsearch with 99% identity prior to blast? This has huge potential to reduce the number of blast queries, i.e. you would just blast the cluster representative sequences. As to whether there's difference between the blast strategies, for sure -task blastn will result in many more hits..

0
Entering edit mode

No I haven't tried that because I need to blast each contig individually. Say I changed the word size of megablast to 11, would it perform similar to blastn in that regard then?

0
Entering edit mode

Each contig? You wrote about blasting unmapped reads, not contigs. What would be the point of blasting exact same sequences n times (the redundancy in your read set), when you could just blast cluster representative sequences? The end result would be the same, except that it would probably take far less time. As to the blast specifics, I'm pretty sure blastn and megablast differ in more ways than just word size (e.g. gap/extension/mismatch penalty scoring)

0
Entering edit mode

In my post I said I am blasting contigs from metaspades. I just mention that the contigs come from unmapped reads to indicate the type of data I am working with. Here, unmapped being treated similar to metagenomic or even environmental DNA - essentially a slurry of DNA sequences.