tblastn doesn't find the whole sequence

0

Entering edit mode

5.8 years ago

jsvedber ▴ 40

Hi all,

I'm having some problems with running TBLASTN locally that I was hoping someone here might have some ideas on what to do with.

I have a smallish fungal genome, and in that genome I have an open reading frame in a single exon forming a contigous 405 bp nucleotide sequence from start codon to stop codon. If I take this 405 bp nucleotide sequence and use blastn on the genome, I find the complete sequence. But if I convert the nucleotide sequence into an 134 amino acids long protein sequence (excluding the stop codon) and use tblastn on the same genome, the hit I get is only 113 amino acids long. I've tried several versions of BLAST from 2.2.31 to 2.7.1, and I've tweaked a number of parameters, for instance window_size, threshold and word_size, but nothing helps and I always get the same 113aa sequence.

Does anyone have any idea on what could cause this, and if is there anything I could try to improve the tblastn results? Alternatively, is there some other piece of software that I can try instead?

Cheers,

Jesper

alignment software error • 3.3k views

ADD COMMENT • link updated 5.8 years ago by Mensur Dlakic ★ 29k • written 5.8 years ago by jsvedber ▴ 40

1

Entering edit mode

apart from Joe 's insights and suggestions you could also play with the gap-open and gap-extensions cost parameters (set them both quite low and see what you get).

Out of curiosity , could you post the alignment of both blasts so we can inspect them?

ADD REPLY • link 5.8 years ago by lieven.sterck 15k

0

Entering edit mode

Changing penalties for starting and opening gaps does not help either.

Here is both the nucleotide and the amino acid alignment:

	########################################################################################
	## First the complete nucleotide alignment

	Score = 749 bits (405), Expect = 0.0
	Identities = 405/405 (100%), Gaps = 0/405 (0%)
	Strand=Plus/Minus

	Query 1 ATGAACAACACCACCTTGCCCGTACCAGTCCCCATTTCGTGCGATGGCAACCCCAGCGTG 60
	\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|
	Sbjct 1736612 ATGAACAACACCACCTTGCCCGTACCAGTCCCCATTTCGTGCGATGGCAACCCCAGCGTG 1736553

	Query 61 CAATGGTATGTTTGGCTCATCGCAAGCATTGCATGGTGTGCAGTGGGCTGCATATGGTTT 120
	\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|
	Sbjct 1736552 CAATGGTATGTTTGGCTCATCGCAAGCATTGCATGGTGTGCAGTGGGCTGCATATGGTTT 1736493

	Query 121 TGGCATCGGTCCGAAACCGTTTGGACCTGGCTGAATGGTCCCATCGAGACATTCCACACC 180
	\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|
	Sbjct 1736492 TGGCATCGGTCCGAAACCGTTTGGACCTGGCTGAATGGTCCCATCGAGACATTCCACACC 1736433

	Query 181 GAGCAGCCGATGACCAGACGCTCGCGAATCACCACAGCCTGTTTCGTCTCAGTCTTGGTC 240
	\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|
	Sbjct 1736432 GAGCAGCCGATGACCAGACGCTCGCGAATCACCACAGCCTGTTTCGTCTCAGTCTTGGTC 1736373

	Query 241 GTTGCCCTGTGGCCATTGGAGATTTTAATCTCCAACATTTGGTCAAGGGTCACGCACGCG 300
	\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|
	Sbjct 1736372 GTTGCCCTGTGGCCATTGGAGATTTTAATCTCCAACATTTGGTCAAGGGTCACGCACGCG 1736313

	Query 301 ACAAACAACCAAGGAACACCAAAAGACCAACCGAGAACGCAACCAGCAGAAGAACCCCAA 360
	\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|
	Sbjct 1736312 ACAAACAACCAAGGAACACCAAAAGACCAACCGAGAACGCAACCAGCAGAAGAACCCCAA 1736253

	Query 361 GCGGCACCCCCGCCGCCATATGATGTCGCAGTGGCCGCCGCATAG 405
	\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|
	Sbjct 1736252 GCGGCACCCCCGCCGCCATATGATGTCGCAGTGGCCGCCGCATAG 1736208



	########################################################################################
	## Then a truncated amino acid alignment


	Score = 229 bits (585), Expect = 1e-70, Method: Compositional matrix adjust.
	Identities = 113/113 (100%), Positives = 113/113 (100%), Gaps = 0/113 (0%)
	Frame = -1

	Query 1 MNNTTLPVPVPISCDGNPSVQWYVWLIASIAWCAVGCIWFWHRSETVWTWLNGPIETFHT 60
	MNNTTLPVPVPISCDGNPSVQWYVWLIASIAWCAVGCIWFWHRSETVWTWLNGPIETFHT
	Sbjct 1736612 MNNTTLPVPVPISCDGNPSVQWYVWLIASIAWCAVGCIWFWHRSETVWTWLNGPIETFHT 1736433

	Query 61 EQPMTRRSRITTACFVSVLVVALWPLEILISNIWSRVTHATNNQGTPKDQPRT 113
	EQPMTRRSRITTACFVSVLVVALWPLEILISNIWSRVTHATNNQGTPKDQPRT
	Sbjct 1736432 EQPMTRRSRITTACFVSVLVVALWPLEILISNIWSRVTHATNNQGTPKDQPRT 1736274

view raw gistfile1.txt hosted with ❤ by GitHub

ADD REPLY • link 5.8 years ago by jsvedber ▴ 40

0

Entering edit mode

Are you using an appropriate translation table? It’s possible that during the translation step, you’re getting a sequence which is less like the subject sequences due to incorrect amino acids?

It’s worth remembering too that BLAST (all variants) are local aligners, so there’s no guarantee you’d necessarily get back a full length sequence.

You could try relaxing your parameters for similarity, and then increase the culling limit to remove redundancy.

Alternatively, give something like DIAMOND or BLAT a go.

ADD REPLY • link 5.8 years ago by Joe 22k

0

Entering edit mode

Hi, thanks for your suggestions.

I'm just using the standard amino acid translation table. I've now translated it again using a different program and I get the same sequence. Relaxing similarity parameters or changing culling does not help either.

I realize that BLAST is a local aligner and some edge effects are expected, but I think it's a bit disconcerting that it doesn't work here, in a situation that seem pretty close to a best case scenario. I can also mention that I've used TBLASTN to look for the same amino acid sequence in ~20 closely related strains and species, and while I get hits in most of them, in no case is the hit longer than 113aa.

I tried BLAT as well (DIAMOND is not set up to easily do protein-to-DNA searches), and BLAT actually finds the whole 134aa sequence with 100 identity. Maybe I will try using BLAT going forward with the project, but I still want to understand what is causing this issue. :)

ADD REPLY • link 5.8 years ago by jsvedber ▴ 40

0

Entering edit mode

Can you post the sequence of missing 23 AA?

ADD REPLY • link 5.8 years ago by GenoMax 152k

0

Entering edit mode

There is not a unique reverse sequence translation from a protein. The same protein can be derived from a few (or many) dna sequences. Not sure how BLAST manages that... If it only converts it to one random sequence then it can happen that it doesn't find your original sequence.

ADD REPLY • link 5.8 years ago by juanjo75es ▴ 130

4

Entering edit mode

As far as I understand, tblastn first convert the target sequence into amino acids in all six reading frames, and then uses searches these sequences.

ADD REPLY • link 5.8 years ago by jsvedber ▴ 40

3

Entering edit mode

5.8 years ago

Mensur Dlakic ★ 29k

The stretch of prolines at the end of your protein sequence is compositionally biased. Have you tried the -filter none switch?

Here is your protein (translated from DNA):

>prot
MNNTTLPVPVPISCDGNPSVQWYVWLIASIAWCAVGCIWFWHRSETVWTW
LNGPIETFHTEQPMTRRSRITTACFVSVLVVALWPLEILISNIWSRVTHA
TNNQGTPKDQPRTQPAEEPQAAPPPPYDVAVAAA

Here is what the filtering program (SEG) says about your protein:

>prot

                                  1-113  MNNTTLPVPVPISCDGNPSVQWYVWLIASI
                                         AWCAVGCIWFWHRSETVWTWLNGPIETFHT
                                         EQPMTRRSRITTACFVSVLVVALWPLEILI
                                         SNIWSRVTHATNNQGTPKDQPRT
         qpaeepqaappppydvavaaa  114-134

In other words, residues 114-134 are of low complexity and will not be visible to TBLASTN unless you turn off the filtering.

ADD COMMENT • link 5.8 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Yes, you are correct! When turning off filtering with the -seg no option, the whole 134aa sequence is returned. I was wondering if there was some low complexity filtering going on, but I didn't know blast would do that based on the amino acid sequence.

This was not super well documented, I have to say, and another thing to think about when using BLAST.

Anyway, thanks for the help! This was really bugging me. :)

ADD REPLY • link 5.8 years ago by jsvedber ▴ 40

1

Entering edit mode

jsvedber : Low complexity filtering is documented in BLAST FAQ.

ADD REPLY • link 5.8 years ago by GenoMax 152k

0

Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted (green check mark on left). You can accept more than one if they work.

ADD REPLY • link 5.8 years ago by GenoMax 152k

0

Entering edit mode

I don't seem to be able to either upvote Mensur's reply or mark it as accepted. Am I missing something?

ADD REPLY • link 5.8 years ago by jsvedber ▴ 40

0

Entering edit mode

Hmm. It is clickable links at the left of the answer. You can't click on either? You may need to have scripting turned on (if you use no script or some such extension).

ADD REPLY • link 5.8 years ago by GenoMax 152k

Login before adding your answer.