Two simillarly annotated sequence has no alignment similarity. Why?
Dear Biostars, Hi (not English. So, be ready for some language flaws)

I have two sequences (from de novo RNA-seq assembly), after blastN (and also blastX), they show similar results and annotations (vasotocin related),

but, when I use NCBI online "Align two or more sequences " ; the answer is: "No significant similarity found".

Why is that? My ssumption is this that as they show the same annotation and same protein products, there should be some similarity (I even hoped for exact 100% match!). Am I wrong?

Thanks

NOTE: my 2 SEQs:

>seq1

TCTGGGGAGCCCACGTAGCAGCCCATCCCTTCCCCACAGCAGATACTGGGGCCAAAGCAG
AGGCCCCGGTTTCCGGGGCCACATGACATGCACGGTCTTTGCAGATCAGGAAAAGAGCGC
TTTCCGCCTCGCGGACAGTTCTGGATGTAGCATGCAGATGAGAGCGCGAGGAGCCCCAGG
ACGCACAGAAGTGGAATAGTAGAATCTGGCATCTTGTTCCGTCTAAAGTGCTTCGTCCAA
TTTGTACTGTAGGCTACGGCTACAACTGAACTCCTTCAAATGGCTCGT

>seq2

GTGGTTGGTTACTGAGGTCTCCCTCTGCTGGTGGCATGTAGGATCCGCAGCAGGTCTCCT
GCCAAACCACCCATTAAGGCAGCGTTCTGTTCGCTGGGTGACTGACGTTTACTGTCCTCT
AGGCAGTCTGGGTCTAGCACACAACTCTCTGAGTCACAGCAGACTCCGGATGCAGCACAG
CTTCCCTCAGAGCCACACACTCTTCCTCCAGCCTCGCAGGGGGAGGGCAGGTAGTTCTCC

Are these full length sequences? If not you may be looking at two different parts of two sequences that code for protein you refer to.

Hi @genomax, maybe I did not understand your answer clearly. these are the Trinity transcripts that had Blast hit with Vasotocin. they are complete as Trinity assembly output and they have completely different IDs (meaning that are different genes or loci)

I am not sure if having two different ID's in Trinity can be considered as evidence that they are complete and are different genes. Here is a clustalO alignment of the two. Ideally we should do an alignment of the translations.

CLUSTAL O(1.2.4) multiple sequence alignment

seq1      TCTGGGGAGCCCACGTAGCAGCCCATCCCTTCCCCACAGCAGATACTGGGGCCAAAGCAG
seq2      -GTG-GTTGGTTACTGAGGTCTCC----------CTC---TGCTGGTGGCATGTAGGATC
** *  *   **  **    **          * *    * *  ***     * *

seq1      AGGCCCCGGTTTCCGGGGCCACATGACATGCACGGTCTTTGCAGAT--CAGGAAAAGAGC
seq2      CGCAGCAGGTCTCCTGCCAAACCACCCATTAAGGCAGCGTTCTGTTCGCTGGGTGACTGA
*   * *** *** *    **    ***  * *     * * * *  * **   *  *

seq1      GCTTTCCGCCTCGCGGACAGTTCTGGATGTAGCATGCAGATGAG--AGCGCGAGGAGCCC
seq2      CGTTTACTGTCCTCTAGGCAGTCTGGGTCTAGCACACAACTCTCTGAGTCACAGCAGACT
*** *    * *       ***** * *****  **  *     **    ** ** *

seq1      CAGGACGCACAGAAGTGGAATAGTAGAATCTGGCATCTTGTTCCGTCTAAAGTGCTTCGT
seq2      CCGGATGCAGCACA------------------GCTTCCCTCAGAGCCACACACTCTTCCT
* *** ***    *                  ** **       * *  *    **** *

seq1      CCAATTTGTACTGTAGGCTACGGCTACAACTGAACTCCTTCAAATGGCTCGT
seq2      CCAGCCTCGCAGGGGGAGGGCAGGTAGTTCTCC-------------------
***   *     *  *    * * **   **

Oh, thanks for your efforts and fast support!

Does it tell that NCBI is correct is showing "no similarity"? or it is showing that there is some similarity?

I took one of the common hits (from individual blast searches from those two sequences) and aligned (Oncorhynchus kisutch vasotocin-neurophysin VT 1.) One would need to spend some time on this. You ideally should a similar exercise with translations and a common protein blast hit.

CLUSTAL O(1.2.4) multiple sequence alignment

seq1                ------------------------------------------------------------
seq2                ------------------------------------------------------------
XM_020465836.1      AATACCGGAAAGTTCCTAGCAGACATTCGAAAAGAAAAACCGAGCCCTTTGAAAGAGTTC

seq1                ------------------------------------------------------------
seq2                ------------------------------------------------------------
XM_020465836.1      AGTTGTAGCCGACAGTATCAATTGGACGAAGCACTTCAGACTGAACAAGATGCCATATTC

seq1                --------------------TCTGGGGAGCCCACGTAGCAGCCCATCCCTTCCCCACAGC
seq2                ------------------------------------------------------------
XM_020465836.1      TACGTTTCCACTGCTGTGGGTCCTGGGGCTCCTCGCGCTATCCT--CCGCGTGCTACATC

seq1                AGATACTGGGGCCAAAGCAG-AGGCCCCGGTTTCCGGGG-CCACATGACATGCACGGTCT
seq2                ------------------------------------------------------------
XM_020465836.1      CAGAACTGTCCGCGAGGCGGGAAGCGCTCTTTTCCTGATCTTCCACGACAGTGCATGTCG

seq1                TT-----------------------------------------G--CAGATCAGGAAAAG
seq2                ---------------------------------------------------------GT-
XM_020465836.1      TGTGGCCCCGGGGACAGGGGCCGCTGCTTTGGCCCCAATATCTGCTGTGGGGAGGGAATG

seq1                AGCGCTTTCCGCCTCGCGGACAGTTCTGGATGTAGCATGCAGATGAGAGCG-CGAGGAGC
seq2                GGTTGGTTACTGAGGTCTCCCTCTGCTGGTGGCATGTAGGATCCGCAGCAGGTCTCCTGC
XM_020465836.1      GGCTGTTACATGGGCTCCCCAGAGGCAGCTGGTTGTGTGGAGGAGAACTACCTGCCCTCC
*    *         *        * *   *      * *   *              *

seq1                CCCAGGACGCACAGAAGTGGAATAGTAGAATCTGGCATCTTGTTCCGTCTAAAGTGCTTC
seq2                CAAACCACCCATTAAGGCAGCGTTCTGTTCGCTG----------GGTGA-----------
XM_020465836.1      CCCTGCGAGGCTGGAGGAAGAGTGTGTGGCTCTG----------AGGGAAGCTGTGCTGC
*             * *  *  *        ***

seq1                GTCCAATTTGTACTGTAGGCTAC---------GGCTACAACTGAACTCCT----------
seq2                --CTGACGTTTACTGTCCTCTAGGCAGTCTGGGTCTAGCACACAACTCTCTGAGTCACAG
XM_020465836.1      ATCCGGAGTCTGCTGTGACTCAGAGAGTTGTGCGCTAGACCCAGACTGCCTAGAGGACAG
*     * * ****     *            ***   *   ***

seq1                ------------------------------------------------------------
seq2                ------------------------------------------------------------
XM_020465836.1      TAAACGTCAGTCACCCAGCGAACAGAACGCTGCCTTAATGGGTGGTTTGGCAGGAGACCT

seq1                ---TCAAATGGCTCGT--------------------------------------------
seq2                ----------------------------------------------CAGACTCCGGAT--
XM_020465836.1      GCTGCAGATCCTACATGCCACCAGCAGAGGGAGACCTCAGTAACCAACCACTGCCCATCC

seq1                ------------------------------------------------------------
seq2                ----------------------GCAGCACAGCTTCCCTCA--------------GAGCCA
XM_020465836.1      CTCACCTGAACACACCCAGAATAGAGCTTAAATTCACCATTTCACATGCACTACTACAAA

seq1                ------------------------------------------------------------
seq2                CACACTCTTCCTCCAGCCTCGCAGGGGGAGG-------------GCAGGT----------
XM_020465836.1      AACAAACCTCACACAGATTCACAGACACACAGCAGAAGTAGAGAGCAGGCTTGCTACATA

seq1                ------------------------------------------
seq2                ---------------AGTTCTCC-------------------
XM_020465836.1      AGGGGGAAATTTATCAGCTCTACATGAATGTTTACTGTGTGC

Oh cool, we did the same >.<

You are telling me that the "tail" of a transcripts code for vasotocin and the "head" of another transcripts code for vasotocin, too. And I am looking at that "tail" and "head" -AND- these "tail" and "head" that code for the same thing, has no sequence similarity?

Am I getting your point correctly?

See mine (and @Wouter's) new answers. It could be simple like that but would need you to look at this carefully.

I just did standard blast for both sequences and find this:

So both sequences indeed have a hit on the same gene.

Running clustal omega for the identified gene and your two sequences looks like this:

Looks like they both belong to the same gene, but to different parts (partially overlapping?).
I'm not sure what's the best conclusion for this.

We did a similar exercise but with two different hits :)

There is some kind of shared domain/site but would need @Farbod to spend time looking at it more closely.

Thank you @WouterDeCoster, but how? translating the nucleotide in Expassy and align the proteins, for example?

That would be a start. Translate into all 6 frames. You may need to try all to see which works best with alignments to common protein hits. Q07662.1 and P16041.1 look like good candidates. They are from swissprot.

Final _1/_2 refer to seq1/seq2. (had to split in two posts).

CLUSTAL O(1.2.4) multiple sequence alignment

5'3'_Frame_3_1      XWGAHVAAHPFPTADTGAKAEAPVSGAT-HARS---------------------------
3'5'_Frame_3_1      --------EPFEGVQL-P-PTVQIGRSTLDGTRC---------------QILLFHFCASW
3'5'_Frame_3_2      ------------------------------RTTCPPPARL--EEEC--------------
3'5'_Frame_1_1      -------------------------TSHLKEFSCSRSLQYKLDEAL-TEQDARFYYSTSV
5'3'_Frame_2_2      ----------------------------------------------------------XW
5'3'_Frame_1_1      ------------------------------------------------------------
5'3'_Frame_3_2      ------------------------------------------------------------
5'3'_Frame_2_1      ------------------------------------------------------------
5'3'_Frame_1_2      ------------------------------------------------------------
3'5'_Frame_1_2      -------------------------------------------------GELPALPLRGW
P16041.1            ------------------------------------------------------------
3'5'_Frame_2_2      ------------------------------------------------------------
*

5'3'_Frame_3_1      ------LQ--IRKRALSASRTVLD------------------------------VACR--
3'5'_Frame_3_1      GSSRSHLH--ATSRTVREAESALF------------------------------LICKD-
3'5'_Frame_3_2      ----VALREAVLHPESAVTQRVVC-TQTA-RTVNVSHPANRTLP-WV----VWQETCCGS
3'5'_Frame_1_1      RPGAPRALICMLHPELSARRKALFS---SAKTVHVMWPRKPGPLLWPQYLL-WGRDGLLR
5'3'_Frame_2_2      ------------------------L--VT-----------EVSLCWWHVG---SAAGLLP
5'3'_Frame_1_1      ------------------------------------------------------------
5'3'_Frame_3_2      ------------------------------------------------------------
5'3'_Frame_2_1      ------------------------------------------------------------
5'3'_Frame_1_2      ----------------------------------XVVGY-GLPL---------LVACRIR
3'5'_Frame_1_2      ------------------------R--KSV--------WL-GKLCCIRSLL-LRELCARP
3'5'_Frame_2_1      ------------------------T--KHF--RRNKMPDSTIPLLCVLGLLALSSACYIQ
P16041.1            ------------------------------------MPYSTFPLLWVLGLLALSSACYIQ
3'5'_Frame_2_2      ------------------------------------------------------------

5'3'_Frame_3_1      ----------------------------------------------EREEPQDAQKWNS-
3'5'_Frame_3_1      ------------------------------------------------------------
3'5'_Frame_3_2      YMPPAEGDLSNQPXX---------------------------------------------
3'5'_Frame_1_1      GLPRX-------------------------------------------------------
5'3'_Frame_2_2      NHPL--------RQRSVRWVTDVYCPLGSLGLAHNSLSHSRL--RMQHSFPQSHTL----
5'3'_Frame_1_1      --XSGEPT-QPIPSPQQI------LGPKQRPRFPGPHDMHGLCRSGKERFPPRGQFWM-H
5'3'_Frame_3_2      -------------------------------------------XGWLLRSPSAGGM-D--
5'3'_Frame_1_2      SR------SPAKPPIKAA-----FCSLGD-RLLSSRQSGSSTQLSESQQTPDAAQLPS--
3'5'_Frame_1_2      RLPRGQ-TSVTQRTERCL-----NGWFGRRP---AA------------------------
3'5'_Frame_2_1      NCPRGGKRSFPDLQRPCM-----SCGPGNRGLCFGPSICCGEGMGCYVGSPXX-------
P16041.1            NCPRGGKRSFPDLPRQCM-----SCGPGDRGRCFGPNICCGEGMGCYMGSPEAAGCV---
3'5'_Frame_2_2      ------------------------------------------------------------

5'3'_Frame_3_1      -RIWHLVP----------------------------------SKVLRPICTVGYGYN-TP
3'5'_Frame_3_1      -RACHVAPETG-------------------------------ASALAPVSAVGKGWAATW
3'5'_Frame_3_2      ------------------------------------------------------------
3'5'_Frame_1_1      ------------------------------------------------------------
5'3'_Frame_2_2      ----FLQPRRGRAGSS--------------------------------------------
5'3'_Frame_2_1      -QMRAR----GAPGRTEVE--NLASCSV-SASSNLYCR--LRLQL---------------
5'3'_Frame_1_2      -EPHTLPPASQGEGR-F---S---------------------------------------
3'5'_Frame_1_2      ----------------------------------------------DPTCHQQRETS-VT
3'5'_Frame_2_1      ------------------------------------------------------------
P16041.1            -EENYLPSPCEAGGRVC---GSEGSCA----ASGVCCD--SESCVLDPDCLEDSKRQ-SP
3'5'_Frame_2_2      --ENYLPSPCEAGGRVC---GSEGSCA----ASGVCCD--SESCVLDPDCLEDSKRQ-SP
*

5'3'_Frame_3_1      SNGS-----------------------------
3'5'_Frame_3_1      APQXX----------------------------
3'5'_Frame_3_2      ---------------------------------
3'5'_Frame_1_1      ---------------------------------
5'3'_Frame_2_2      ---------------------------------
5'3'_Frame_1_1      --------ATATTELLQMAR-------------
5'3'_Frame_3_2      S--SSSLAGGGQ----VVL--------------
5'3'_Frame_2_1      -------------NSFKWL--------------
5'3'_Frame_1_2      ---------------------------------
3'5'_Frame_1_2      NHX------------------------------
3'5'_Frame_2_1      ---------------------------------
P16041.1            SEQNAALMGGLAGDLLRILHATSRGRPQ-----
3'5'_Frame_2_2      SEQNAALMGGLAGDLLRILHATSRGRPQ-PTXX
*

One other thing,

teleost has experienced Whole Genome Duplication event (in salmonids it is even more than one WGD), does it has any effect on this situation? are we encountering paralogous genes or duplicated genes or some other evolutionary phenomenon instead of watching different parts of a LONG gene?

Oh, I guess that's certainly possible. Also possible that one of the copies starts acquiring mutations and is functionally no longer doing the same as the other copy.

This is part2 right? After upvoting your order is gone :)

I really appreciate that.

thank you, it is not the same thing and both of your helps has unique taste for me.

So, I would ask same thing that I have asked from @genomax:

You are telling me that the "tail" of a transcripts code for vasotocin and the "head" of another transcripts code for vasotocin, too. And I am looking at that "tail" and "head" -AND- these "tail" and "head" that code for the same thing, has no sequence similarity?

Am I getting your point correctly?

To be honest, I have no idea. I just did a blast and an alignment to see what came up.

Entering edit mode

