Question

How to deal with single transcripts assigned to multiple genes?

0

Entering edit mode

5.6 years ago

antoinefelden ▴ 60

After I ran StringTie after Hisat2 on a non-model RNA-seq data set (i.e. Argentine ant), I realised that some StringTie transcripts were assigned to multiple genes (see exemple below) which is causing many problems down the line.

24887   MSTRG.1473 LOC105670921
24920   MSTRG.1473 LOC105670793
25000   MSTRG.1473 LOC105670784
...
27182   MSTRG.1603 LOC105671758
27194   MSTRG.1603 LOC105671753

Because I could not find any score that would help me to select the best match between a StringTie transcript and assigned genes, my first approach (although knowing it was wrong) was to select the first pair, and discard the others. But I realised that it is a widespread issue in my dataset, so I don't feel comfortable at all doing this.

Another detail that may be useful: I then had a look at what these genes were, they do not seem to be homologous but they seem to be always located on the same genomic region. See for yourself in the exemple below with MSTRG.1473.

Is there any proper way to deal with that?

MSTRG.1473 is simultaneously assigned to these three genes circled in green:

RNA-Seq StringTie hisat2 • 1.5k views

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 5.6 years ago by antoinefelden ▴ 60

score 3 · Accepted Answer · 2018-09-21

Okay, I solved that problem, which was not really one but in fact a feature of StringTie.

See https://github.com/gpertea/stringtie/issues/170

In a nutshell, there is an alternative - simpler - StringTie pipeline that skip the assembly step. So what this does is simply to map the reads, without looking for novel transcripts (that matched several loci in the case discussed above). It's quick and dirty because it's discarding a lot of potentially interesting data, but that's what I wanted.