Question: tblastn doesn't find the whole sequence
gravatar for jsvedber
6 months ago by
jsvedber40 wrote:

Hi all,

I'm having some problems with running TBLASTN locally that I was hoping someone here might have some ideas on what to do with.

I have a smallish fungal genome, and in that genome I have an open reading frame in a single exon forming a contigous 405 bp nucleotide sequence from start codon to stop codon. If I take this 405 bp nucleotide sequence and use blastn on the genome, I find the complete sequence. But if I convert the nucleotide sequence into an 134 amino acids long protein sequence (excluding the stop codon) and use tblastn on the same genome, the hit I get is only 113 amino acids long. I've tried several versions of BLAST from 2.2.31 to 2.7.1, and I've tweaked a number of parameters, for instance window_size, threshold and word_size, but nothing helps and I always get the same 113aa sequence.

Does anyone have any idea on what could cause this, and if is there anything I could try to improve the tblastn results? Alternatively, is there some other piece of software that I can try instead?



alignment software error • 359 views
ADD COMMENTlink modified 6 months ago by Mensur Dlakic4.0k • written 6 months ago by jsvedber40

apart from Joe 's insights and suggestions you could also play with the gap-open and gap-extensions cost parameters (set them both quite low and see what you get).

Out of curiosity , could you post the alignment of both blasts so we can inspect them?

ADD REPLYlink written 6 months ago by lieven.sterck7.2k

Changing penalties for starting and opening gaps does not help either.

Here is both the nucleotide and the amino acid alignment:

ADD REPLYlink modified 6 months ago • written 6 months ago by jsvedber40

Are you using an appropriate translation table? It’s possible that during the translation step, you’re getting a sequence which is less like the subject sequences due to incorrect amino acids?

It’s worth remembering too that BLAST (all variants) are local aligners, so there’s no guarantee you’d necessarily get back a full length sequence.

You could try relaxing your parameters for similarity, and then increase the culling limit to remove redundancy.

Alternatively, give something like DIAMOND or BLAT a go.

ADD REPLYlink written 6 months ago by Joe16k

Hi, thanks for your suggestions.

I'm just using the standard amino acid translation table. I've now translated it again using a different program and I get the same sequence. Relaxing similarity parameters or changing culling does not help either.

I realize that BLAST is a local aligner and some edge effects are expected, but I think it's a bit disconcerting that it doesn't work here, in a situation that seem pretty close to a best case scenario. I can also mention that I've used TBLASTN to look for the same amino acid sequence in ~20 closely related strains and species, and while I get hits in most of them, in no case is the hit longer than 113aa.

I tried BLAT as well (DIAMOND is not set up to easily do protein-to-DNA searches), and BLAT actually finds the whole 134aa sequence with 100 identity. Maybe I will try using BLAT going forward with the project, but I still want to understand what is causing this issue. :)

ADD REPLYlink written 6 months ago by jsvedber40

Can you post the sequence of missing 23 AA?

ADD REPLYlink written 6 months ago by genomax80k

There is not a unique reverse sequence translation from a protein. The same protein can be derived from a few (or many) dna sequences. Not sure how BLAST manages that... If it only converts it to one random sequence then it can happen that it doesn't find your original sequence. enter image description here

ADD REPLYlink written 6 months ago by juanjo75es70

As far as I understand, tblastn first convert the target sequence into amino acids in all six reading frames, and then uses searches these sequences.

ADD REPLYlink written 6 months ago by jsvedber40
gravatar for Mensur Dlakic
6 months ago by
Mensur Dlakic4.0k
Mensur Dlakic4.0k wrote:

The stretch of prolines at the end of your protein sequence is compositionally biased. Have you tried the -filter none switch?

Here is your protein (translated from DNA):


Here is what the filtering program (SEG) says about your protein:


                                  1-113  MNNTTLPVPVPISCDGNPSVQWYVWLIASI
         qpaeepqaappppydvavaaa  114-134

In other words, residues 114-134 are of low complexity and will not be visible to TBLASTN unless you turn off the filtering.

ADD COMMENTlink modified 6 months ago • written 6 months ago by Mensur Dlakic4.0k

Yes, you are correct! When turning off filtering with the -seg no option, the whole 134aa sequence is returned. I was wondering if there was some low complexity filtering going on, but I didn't know blast would do that based on the amino acid sequence.

This was not super well documented, I have to say, and another thing to think about when using BLAST.

Anyway, thanks for the help! This was really bugging me. :)

ADD REPLYlink modified 6 months ago • written 6 months ago by jsvedber40

jsvedber : Low complexity filtering is documented in BLAST FAQ.

ADD REPLYlink modified 6 months ago • written 6 months ago by genomax80k

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted (green check mark on left). You can accept more than one if they work.

ADD REPLYlink modified 6 months ago • written 6 months ago by genomax80k

I don't seem to be able to either upvote Mensur's reply or mark it as accepted. Am I missing something?

ADD REPLYlink written 6 months ago by jsvedber40

Hmm. It is clickable links at the left of the answer. You can't click on either? You may need to have scripting turned on (if you use no script or some such extension).


ADD REPLYlink written 6 months ago by genomax80k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1211 users visited in the last hour