Question: Blast Alignment Bug
1
gravatar for Maria K
7.5 years ago by
Maria K60
Maria K60 wrote:

I tried to find a short oligonucleotide sequence (probe) in a transcript and I knew for sure that the transcript contained the probe. But the latest version of the BLAST stand-alone algorithm (2.2.25+) found no match for the probe. Surprisingly enough, when I split the probe sequence in two parts both were found in the transcript one after the other. Moreover, when I deleted the two first nucleotides from the probe sequence, BLAST managed to find the correct matching. Could anyone explain what kind of problem I am facing? I do have about 20 such probe sequences that were not found by BLAST even if there was a perfect matching.

I tried to find matching with the following parameters:

blastn -query probe.fa -db target -task blastn-short -word_size 7 -evalue 100 -out res.out

UPDATE: It was really helpful to change the -wordsize parameter to 5. The BLASTN algorithm managed to find the correct matching. BUT there are still several probes, for which it fails to find the correct matching although the transcript contains the probe sequence for sure. The stand-alone BLAST version allows to set the -wordsize parameter >=4, but even with -word_size=4 the matching couldn't be found. The online BLAST finds the matching. What should I do in this case?

The new problem data is:

>probe_seq

CCCCCCCCTCGGAGAGAGAGAGA

>transcript_seq

tccctctcccccccttctctctctctccgaggggggggggtcccagggagggaggggggg tcccccgatcagcatgtggctcctggcgctgtgtctggtggggctggcgggggctcaacg cgggggagggggtcccggcggcggcgccccgggcggccccggcctgggcctcggcagcct cggcgaggagcgcttcccggtggtgaacacggcctacgggcgagtgcgcggtgtgcggcg cgagctcaacaacgagatcctgggccccgtcgtgcagttcttgggcgtgccctacgccac gccgcccctgggcgcccgccgcttccagccgcctgaggcgcccgcctcgtggcccggcgt gcgcaacgccaccaccctgccgcccgcctgcccgcagaacctgcacggggcgctgcccgc catcatgctgcctgtgtggttcaccgacaacttggaggcggccgccacctacgtgcagaa ccagagcgaggactgcctgtacctcaacctctacgtgcccaccgaggacggtccgctcac aaaaaaacgtgacgaggcgacgctcaatccgccagacacagatatccgtgaccctgggaa gaagcctgtgatgctgtttctccatggcggctcctacatggaggggaccggaaacatgtt cgatggctcagtcctggctgcctatggcaacgtcattgtagccacgctcaactaccgtct tggggtgctcggttttctcagcaccggggaccaggctgcaaaaggcaactatgggctcct ggaccagatccaggccctgcgctggctcagtgaaaacatcgcccactttgggggcgaccc cgagcgtatcaccatctttggttccggggcaggggcctcctgcgtcaaccttctgatcct ctcccaccattcagaagggctgttccagaaggccatcgcccagagtggcaccgccatttc cagctggtctgtcaactaccagccgctcaagtacacgcggctgctggcagccaaggtggg ctgtgaccgagaggacagcgctgaagctgtggagtgtctgcgccggaagccctcccggga gctggtggaccaggacgtgcagcctgcccgctaccacatcgcctttgggcccgtggtgga tggcgacgtggtccccgatgaccctgagatcctcatgcagcagggagaattcctcaacta cgacatgctcatcggcgtcaaccagggagagggcctcaagttcgtggaggactctgcaga gagcgaggacggtgtgtctgccagcgcctttgacttcactgtctccaactttgtggacaa cctgtatggctacccggaaggcaaggatgtgcttcgggagaccatcaagtttatgtacac agactgggccgaccgggacaatggcgaaatgcgccgcaaaaccctgctggcgctctttac tgaccaccaatgggtggcaccagctgtggccactgccaagctgcacgccgactaccagtc tcccgtctacttttacaccttctaccaccactgccaggcggagggccggcctgagtgggc agatgcggcgcacggggatgaactgccctatgtctttggcgtgcccatggtgggtgccac cgacctcttcccctgtaacttctccaagaatgacgtcatgctcagtgccgtggtcatgac ctactggaccaacttcgccaagactggggaccccaaccagccggtgccgcaggataccaa gttcatccacaccaagcccaatcgcttcgaggaggtggtgtggagcaaattcaacagcaa ggagaagcagtatctgcacataggcctgaagccacgcgtgcgtgacaactaccgcgccaa caaggtggccttctggctggagctcgtgccccacctgcacaacctgcacacggagctctt caccaccaccacgcgcctgcctccctacgccacgcgctggccgcctcgtccccccgctgg cgccccgggcacacgccggcccccgccgcctgccaccctgcctcccgagcccgagcccga gcccggcccaagggcctatgaccgcttccccggggactcacgggactactccacggagct gagcgtcaccgtggccgtgggtgcctccctcctcttcctcaacatcctggcctttgctgc cctctactacaagcgggaccggcggcaggagctgcggtgcaggcggcttagcccacctgg cggctcaggctctggcgtgcctggtgggggccccctgctccccgccgcgggccgtgagct gccaccagaggaggagctggtgtcactgcagctgaagcggggtggtggcgtcggggcgga ccctgccgaggctctgcgccctgcctgcccgcccgactacaccctggccctgcgccgggc accggacgatgtgcctctcttggcccccggggccctgaccctgctgcccagtggcctggg gccaccgccacccccaccgcccccctcccttcatcccttcgggcccttccccccgccccc tcccaccgccaccagccacaacaacacgctaccccacccccactccaccactcgggtata gggggtgggtggggaggccctcctccccggccctccctggcccggccactccgaaggcag ggaggaggacttggcaactggcttttctcctgtggagtcgtcacacgccatccagcagcg ctaaggtggacatgggattcctccctgcgatgcgtgtctttcccacgcagagaagcccag tctcttctctggatctgggcctttgaacaactggggggcgttttctcccccccattggga caccagtcttcggtgtgtggaatgtggtattttcccgcgtggaggtgtgctttctcacaa cggggtgtgttttcccatgtgcagggtgaggtttttttttgccaccctggacacatgttg gccccctcaaagaatttctgtggggatttgtaccccagaatcctgttcccccatcccttc tcccacctcctcccctctccctccccctggagaccctggaagtggtgtgttcacatacag tgacccttggccaccagaccacagaggatggagcctgggaagcagcgaggaaatcacagc cccctcgcccctgcctcccttgcccctaccccggcgaagcatgttccccccgacgccccc cttggcacaagtcagatgaagcacgttctgccggggaggccctcaccttccagagaggac agacacagatttcctgctgggggagggaggagtccacgcatcctgatgctgcctggaagc ttattttcccgtggccaggacgcatttctctgagtggaaacaggttcttgcatgtggatg tgtgtttccccaggcagacggcccctctcttcccagcacttccctgcctcccccaggcct caggcccagcacccagttcctcctcacatggcaggtgagcacagacttctagttggcagg agctgaggagggtgaacaaaccccgagggaggcccggcccttgctcccgagttgggggga gggggtgtggcaacgtgccccccgcagaggccacgcatgtttgaccaaagccctcattgt ggtccgaggacagccttttccccaggcctcagagcattgctcatccgtgccaaactgggt aggtggatttgagcggaaagactcccaaaatgtgccaagaatttcccagtcccaggcagg gcaggggaaactaagggcaagcaggatacagggcgagggatgtggcaggtgagggggctc ccgcctgtgccccttctcctcaccatgtctcccccaccctgcctcagttctccgttcccc ttcatctccgtccccctctttgaagctgtccccatctcagtgtcagaccagccttctcct cagctgaccaccctcctctgacccacgccccctccttgtctgaaagaaaggagccttgaa tggtggagggaggcagtggggagaaaggtctcaccggacaggttgggagaatgaggtcag cggtgctggggaacagatggagggggcagtggggacagggcttgggcagacaccagcagg aataatttgaaatgtgtgaggtgactccccggagggccttgggcttgggcatttgggaaa agaatgatgtctggaagggcttaagggacacagtggacgaggggagagtcctcatctgct ggcattttgtggggtgttagtgccaaacttgaataggggctggggtgctgtcttccactg acacccaaatccagaatccctggtcttgagtccccagaactttgcctcttgactgtccct tctcttcctacctccatccatggaaaattagttattttctgatcctttcccctgcctggt ctagctcctctccaaacagccatgccctccaaatgctagagacctgggccctgaaccctg tagacagatgccctcagaattggggcatgggaggggggctgggggaccccatgattcagc cacggactccaatgcccagctcctctccccaaaacaatcccgacaatcccttatccctac cccaaccctttgcggctctgtacacatttttaaacctggcaaaagatgaagagaatattg taaatataaaagtttaactgtt

And the correct matching position is 38-14.

alignment blast blastn • 2.2k views
ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Maria K60

Could you provide the command you used to run the local copy of BLAST. It may be that the parameter settings you used are different to the ones used by the web tool.

ADD REPLYlink written 7.5 years ago by SimonCB765150

I used

blastn -query probe.fa - db target -task blastn-short -wordsize 7 -evalue 100 -out blastprobe.out It turned out that if wordsize=5 BLASTN manages to find the correct matching.

ADD REPLYlink written 7.5 years ago by Maria K60
7
gravatar for Jake
7.5 years ago by
Jake150
Oxford, UK
Jake150 wrote:

You need to use a smaller Word size. The online version automatically works that out. This is nicely explained here. So:- Like yourself when I run blast with the standard parameters I get no hits.

Running with:-

blastn -db data.fasta -query target.fasta -word_size 5

I get the hit.

ADD COMMENTlink written 7.5 years ago by Jake150

Best answer +1. For oligos, you can even back the word size down a bit more.

ADD REPLYlink written 7.5 years ago by Larry_Parnell16k

Thanks, Jake! That was really helpful. But I still don't understand, why BLASTN worked for about 800 000 of short 25-nucleotide sequences correctly and only for 20 sequences i have to use a smaller word_size parameter. Perhaps it's a problem of realisation.

ADD REPLYlink written 7.5 years ago by Maria K60

UPDATE: With this probe sequence the algorithm works, but I found several more sequences for which BLASTN fails to find the correct matching even with -word_size=4. And 4 is the smallest allowed parameter value.

ADD REPLYlink written 7.5 years ago by Maria K60

Also disable masking. For old blastall, the option is "-F F".

ADD REPLYlink written 7.5 years ago by lh331k

@Maria: to me, it is impropriate to claim a program has a bug before you thoroughly understand how it works.

ADD REPLYlink written 7.5 years ago by lh331k
1
gravatar for Maria K
7.5 years ago by
Maria K60
Maria K60 wrote:

Actually the answer could be found at NCBI FAQs http://www.genebee.msu.su/blast/blast_faqs.html

It's advised to remove filtering of the query sequence. BLAST filters low-complexity sequences and thus the query sequence may become even shorter than 25 nucleotides and the matching will have low statistical significance. To disable filtering the -dust parameter should be set to 'no'. In this way BLASTN finds the matching:

>blastn -query probe.fa -db target -task blastn-short -word_size 7 -evalue 1000 -dust no -out res.out

ADD COMMENTlink written 7.5 years ago by Maria K60

Option -F F works the same way.

ADD REPLYlink written 7.5 years ago by boczniak767640
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1591 users visited in the last hour