Question

StringTie: To use a reference gene-set or not?

1

Entering edit mode

7.7 years ago

i.sudbery 19k

It was always the advice that Cufflinks should be run in RABT mode using a reference gene set to guide the assembly of new transcripts.

I've been exploring StringTie as an alternative to cufflinks, and I have to say. I'm very impressed with its speed: A job that took Cufflinks 6.5hours on 8 cores, took StringTie 10 minutes on 4. This is quicker even than the StringTie Author's benchmarks. I wonder if part of the reason for the speed is that I was using a reference geneset (the ensembl 85 gtf). I also wonder if this is why StringTie seems to be failing to assemble some transcripts that Cufflinks does? (and look real to me).

Does anyone else have any expereince on the effect of including a reference transcriptome in a StringTie assembly? What about a StringTie merge?

RNA-Seq stringtie • 6.4k views

ADD COMMENT • link updated 7.6 years ago by Biostar 20 • written 7.7 years ago by i.sudbery 19k

0

Entering edit mode

I'm looking for the same answers as well.

Thanks again for advising me stringtie, I was stucked with cufflinks during more than one week, and now I could finally have some results in like 10 minutes with stringtie ! ( without reference by the way )

But when I looked in IGV, I added all the track corresponding to my tophat2 bam files output for different samples, and then the GTF stringtie computed, I think that sometimes, it miss some genes for some reasons, because I have a lot of aligned reads for all the conditions (all the samples), within the same location, with enough coverage, but stringtie doesn't predict anything...

So I was wondering if I should add a reference so Stringtie will be able to detect theses ones, but also maybe change some parameters ?

Is it the same question than mine or I misunderstood ?

ADD REPLY • link 7.7 years ago by Rox ★ 1.4k

0

Entering edit mode

I've kind of come to the decision that the reference doesn't make that much difference either way (i'd still love to hear from people with other expereiences). Rather, the stringtie parameter -f, which sets the minimum fraction of the output from a locus coming from a particular transcript, together with the stringtie --merge parameters -F and -T which control the minimum FPKM and TPM of a transcript were causing the filtering of isoforms. FIltering on both FPKM and TPM seems a bit unneccesary to me, so I left the TPM filter at 1, but put the FPKM filter to 0. I also relaxed the isoform fraction filter a little.

There is still a particular isoform I can't get stringtie to call, but there are also others it calls but cufflinks doesn't, so I suppose I'll just have to live with that.

ADD REPLY • link 7.7 years ago by i.sudbery 19k

0

Entering edit mode

That's why we should use Strintie + Trinity (or something else) in order to get accurate results, right ?

Okay, so I guess I'm going to play with theses parameters and see what happens ! I also asked a question on github about this, I'll let you know the answers I get if you want !

ADD REPLY • link 7.7 years ago by Rox ★ 1.4k

1

Entering edit mode

I took the liberty of having a look at your stringtie github issue. I think I can see your problem - those reads that you show where there is no transcript assembled are shown as hollow by IGV. I think this means they have a mapping quality of 0, possibly they are multi-mappers and this is some sort of repeat sequence?

Back when I used to work with guys that were trying to annotate lncRNAs, they would get a lot of these very short piles of reads in the middle of nowhere - turned out that they were mis-mapping reads, and should have been mapping across splice junctions, so thats another alternative (and again would be solved by a trinity approach).