Question

Stringtie2: how to interpret conflicts between results

0

Entering edit mode

4.2 years ago

nlehmann ▴ 140

Hi,

I am using Stringtie2 in order to obtain a better annotation of chicken genome (galGal6) using long-reads data (oxford nanopore = ONT). I visualized some gene of interest with IGV, but I found there are some conflicts that I am struggling to interpret.

First example:

In the first example, we have the following tracks (from up to down):

1- the averall coverage of the ONT reads ;
2- the ONT reads ;
3- Gene (NCBI reference): the reference (original) GTF file
4- GTF obtained with merged stringtie (NCBI ref merged with stringtie results from 5) Result of this command : stringtie --merge -p 20 -G $GFF -o merged_stringtie.out.gtf lr_guided_allbam.out.gtf
5- GTF obtained with stringtie (ONT data + ref NCBI) Result of this command : stringtie -L -p 20 -G $GFF -o lr_guided_allbam.out.gtf $INPUT
6- GTF obtained with stringtie (ONT data only) Result of this command : stringtie -L -p 20 -o lr_allbam.out.gtf $INPUT

The first question is how come there is such a conflict between 4 and 5 ? I could not find in stringtie documentation what could explain this situation. It looks like Stringtie is giving the priority to the original reference when there is a conflict. Is that right ? Is there a way to modify / quantify these situtations with stringtie or any other tool ?

Second example:

In this second example, the tracks are ordered the same may as above (and same stringtie command for each track). Here, the result of the merging (track #4) is very satisfying: the gene have been elongated towards the 3' end. However, when you compare tracks #5 and #6: why the signal detected in the 2 cases are so different ? In the #5, we would have expected stringtie to add an annotation in the 5' part of the gene (because there is some signal detected, as shown in track #6).

Thanks a lot for you help in understanding these results.

RNA-Seq long-reads de novo annotation stringtie • 2.1k views

ADD COMMENT • link updated 4.1 years ago by Kevin Blighe 87k • written 4.2 years ago by nlehmann ▴ 140

score 0 · Answer 1 · 2020-03-01

0

Entering edit mode

4.1 years ago

Kevin Blighe 87k

It is difficult for anybody here to answer these questions for you - they are very specific to your dataset, a dataset the access to which we do not have. Please explore what happens when --merge is activated. The processing of StringTie with --merge is going to be dependent on numerous things, such as the number of samples in your study, coverage over the regions, etc.

Kevin

ADD COMMENT • link 4.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks for your reply. That's what I am investigating now. I wondered if this was an expected behavior of Stringtie, but it seems that it's not so obvious.

ADD REPLY • link 4.1 years ago by nlehmann ▴ 140

0

Entering edit mode

I would label this as a 'quirk' of your dataset and how StringTie functions. These things are expected in bioinformatics - no single program or algorithm can account for the respective intricacies of each dataset. Looking at your screenshot, the coverage over that region is not high, so, that may be the key factor in this case. Are those reads primary or secondary alignments?; what is their MAPQ? If you hover the mouse cursor over them, you'll see more info.

ADD REPLY • link 4.1 years ago by Kevin Blighe 87k