Question: StringTie: To use a reference gene-set or not?
gravatar for i.sudbery
4.4 years ago by
Sheffield, UK
i.sudbery10k wrote:

It was always the advice that Cufflinks should be run in RABT mode using a reference gene set to guide the assembly of new transcripts.

I've been exploring StringTie as an alternative to cufflinks, and I have to say. I'm very impressed with its speed: A job that took Cufflinks 6.5hours on 8 cores, took StringTie 10 minutes on 4. This is quicker even than the StringTie Author's benchmarks. I wonder if part of the reason for the speed is that I was using a reference geneset (the ensembl 85 gtf). I also wonder if this is why StringTie seems to be failing to assemble some transcripts that Cufflinks does? (and look real to me).

Does anyone else have any expereince on the effect of including a reference transcriptome in a StringTie assembly? What about a StringTie merge?

rna-seq stringtie • 4.7k views
ADD COMMENTlink modified 4.3 years ago by Biostar ♦♦ 20 • written 4.4 years ago by i.sudbery10k

I'm looking for the same answers as well.

Thanks again for advising me stringtie, I was stucked with cufflinks during more than one week, and now I could finally have some results in like 10 minutes with stringtie ! ( without reference by the way )

But when I looked in IGV, I added all the track corresponding to my tophat2 bam files output for different samples, and then the GTF stringtie computed, I think that sometimes, it miss some genes for some reasons, because I have a lot of aligned reads for all the conditions (all the samples), within the same location, with enough coverage, but stringtie doesn't predict anything...

So I was wondering if I should add a reference so Stringtie will be able to detect theses ones, but also maybe change some parameters ?

Is it the same question than mine or I misunderstood ?

ADD REPLYlink written 4.4 years ago by Rox1.3k

I've kind of come to the decision that the reference doesn't make that much difference either way (i'd still love to hear from people with other expereiences). Rather, the stringtie parameter -f, which sets the minimum fraction of the output from a locus coming from a particular transcript, together with the stringtie --merge parameters -F and -T which control the minimum FPKM and TPM of a transcript were causing the filtering of isoforms. FIltering on both FPKM and TPM seems a bit unneccesary to me, so I left the TPM filter at 1, but put the FPKM filter to 0. I also relaxed the isoform fraction filter a little.

There is still a particular isoform I can't get stringtie to call, but there are also others it calls but cufflinks doesn't, so I suppose I'll just have to live with that.

ADD REPLYlink written 4.4 years ago by i.sudbery10k

That's why we should use Strintie + Trinity (or something else) in order to get accurate results, right ?

Okay, so I guess I'm going to play with theses parameters and see what happens ! I also asked a question on github about this, I'll let you know the answers I get if you want !

ADD REPLYlink written 4.4 years ago by Rox1.3k

I took the liberty of having a look at your stringtie github issue. I think I can see your problem - those reads that you show where there is no transcript assembled are shown as hollow by IGV. I think this means they have a mapping quality of 0, possibly they are multi-mappers and this is some sort of repeat sequence?

Back when I used to work with guys that were trying to annotate lncRNAs, they would get a lot of these very short piles of reads in the middle of nowhere - turned out that they were mis-mapping reads, and should have been mapping across splice junctions, so thats another alternative (and again would be solved by a trinity approach).

ADD REPLYlink written 4.4 years ago by i.sudbery10k

that would be great. Cheers.

ADD REPLYlink written 4.4 years ago by i.sudbery10k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 935 users visited in the last hour