Question

Same transcript appear on different scaffolds. Why?

0

Entering edit mode

7.6 years ago

Kenny ▴ 30

Hi all,

I did a reverse BLAST where my trancriptome is the database and my scaffold is the query. The reason why I do this is to see where the transcripts fall in the genome and if there are a lot of intron-like gaps.

When I look at the output, I found that some transcripts appear in more than one scaffold. What could be the reasons?

cat (my_blast_output).txt | grep -A 2 "Sequences producing significant alignments" | grep comp | awk '{print $1} ' | sort | uniq -c | sort -r
      5 comp65253_c1_seq4
      5 comp48571_c0_seq4
      4 comp65721_c3_seq3
      4 comp63218_c0_seq2
      3 comp65722_c0_seq1
      3 comp64106_c2_seq22
      3 comp54658_c0_seq1
      3 comp45777_c0_seq2
      3 comp23829_c0_seq1
      2 comp85529_c0_seq1
      2 comp63346_c0_seq6
      2 comp57872_c0_seq1
      2 comp25489_c0_seq1
      2 comp100860_c0_seq1
      1 comp66091_c0_seq10
      1 comp65186_c0_seq6

Also, not the entire transcript was represented. What could be the reasons?

If you want to see the blast output, here it is: https://www.dropbox.com/s/xv5fnb54kwqi6ii/reverse_blastout_outfmt4_121317_simplified.txt?dl=0

I am fairly new in data analysis. That's why I need more guidance/help at this stage. What other conclusions can I make from this blast output?

alignment • 1.6k views

ADD COMMENT • link updated 7.6 years ago by Matteo Schiavinato ★ 3.7k • written 7.6 years ago by Kenny ▴ 30

score 2 · Answer 1 · 2017-12-19

There are many reasons for this behavior!

First, you must take into account the fact that many genes actually have a second copy within the same genome. Sometimes it is an additional copy that is activated in stress conditions, sometimes a second copy with a variation that slightly changes the function, some other times just a second copy. Having expression data to map on the different copies usually reveals a world.

Second, genome assemblies are not perfect and if the genome is assembled from Illumina or any kind of short read sequencing technology reads, it can happen that the two scaffolds where you have hits are the two homologous chromosome regions assembled twice. Try to map them one against each other with a dot matrix and see if you get a thick diagonal line (that would mean they're the same sequence).

Third, you might actually have the same region twice in the genome! This is the principal outcome of a genome region duplication: a region is duplicated and what's in it becomes doubled.

Fourth: pseudogenes! This is quite common, or at least more than people think. Degenerated gene sequences that don't code for anything anymore, sometimes are not even transcribed. However, the genome sequences of the pseudogenes still resembles the one of the original gene (or a fraction of it) to some extent. Blasting with, let's say, 95% minimum sequence identity required, might indeed make you find pseudogenic hits.

All in all: try to map your sequences against the genome with a very stringent minimum sequence identity (say 99%). I would expect many of the secondary hits would disappear, and some would not (those might be second copies).