Question

De novo transcript assembly from multiple RNA-seq datasets

0

Entering edit mode

9.2 years ago

joao.curado ▴ 10

Hi everyone

I have 100 RNA-seq sets from 100 human individuals (diseased and controls). I would like to discovery novel transcripts (especially ones that are only expressed in diseased individuals). What is the best approach?

1) run 100 times stringTie and then merge?

2) merge the 100 BAMs and run stringTie (if it's possible to run it...)

3) other?

I would like to optimize the discovery of lowly expressed transcripts. Do you have any idea? All feedback is welcome.

Thank you very much

RNA-Seq transcript assembly stringtie • 2.7k views

ADD COMMENT • link 9.2 years ago by joao.curado ▴ 10

3

Entering edit mode

9.2 years ago

andrew.j.skelton73 6.6k

Novel discovery is always a tricky task. If you haven't done already, I'd optimise your process of analysing your known genes / transcripts before you start at looking into novel elements. Cufflinks and StringTie are your major two options, run them on each of your samples, then merge them. Once they're merged, you have a GTF file that shows every potential transcript in all your samples (heads up, it'll likely be a lot). You can then quantify against this GTF file and perform differential expression. Finding lowly expressed novel transcripts from RNA Seq data is an extremely difficult challenge, as the nature of short read sequencing makes it very difficult in the first place! Depending on the depth of your sequencing will decide what you can essentially discover, and quantify. Whatever disease you're studying, and what you're sampling (blood, tissue specific, isolated cells), will also play into the equation of what you can find too. The good news is that you have a pretty decent sample size with 100 samples!

ADD COMMENT • link 9.2 years ago by andrew.j.skelton73 6.6k

score 2 · Accepted Answer · 2016-04-13

2

Entering edit mode

9.2 years ago

joao.curado ▴ 10

Thank you very much for your answer. It's blood samples and the sequence is not so deep, 80 million reads unstranded (but, as expected, a big portion maps to the hemoglobin genes...)

So you think it's better to run the samples separately and merge after. What is the advantage in comparison with option 2?

ADD COMMENT • link 9.2 years ago by joao.curado ▴ 10

3

Entering edit mode

Blood Samples, without haemoglobin depletion means that you'll probably see a very large proportion map to haemoglobin genes! The general consensus is that performing it sample by sample is the usual protocol, and it's usually because there's person to person variance, and merging may mask a transcript that's unique to a small proportion of people (for example). If you do find that you're clutching at straws, you could always merge the Normal samples, run transcript discovery, repeat for your disease state, merge the two, quantify. 80 Million reads isn't bad, the major issue you'll have is the proportion that map to haemoglobin unfortunately.

If I was in your situation here's what I'd do:

First Step:

Quantify against reference using Kallisto or Salmon
Read into DESeq2
Check proportion of reads that map to haemoglobin

Second Step:

Differential Expression at Gene Level using DESeq2 (With your sample size, you may have to use Limma Voom - You'll have to see how it goes)
Differential Transcript Expression using Sleuth - See what you get, you might be surprised.

Third Step

Run Cufflinks, or StringTie on each sample
Merge to get merged GTF
Create new Kallisto or Salmon Transcriptome Index against the new GTF
Quantify fastq against the "Novel" transcriptome Index
Sleuth Differential Transcript Expression
If you find anything, you'll basically need to check them out in IGV and decide if they look real. Chase up using qPCR, or whatever method you want.

ADD REPLY • link 9.2 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

this is very helpful Andrew. I'll give it a try. Thanks for the feedback. I'll say how it went in the end.

ADD REPLY • link 9.2 years ago by joao.curado ▴ 10