Question: De novo transcript assembly from multiple RNA-seq datasets
gravatar for joao.curado
4.0 years ago by
joao.curado0 wrote:

Hi everyone

I have 100 RNA-seq sets from 100 human individuals (diseased and controls). I would like to discovery novel transcripts (especially ones that are only expressed in diseased individuals). What is the best approach?

1) run 100 times stringTie and then merge?

2) merge the 100 BAMs and run stringTie (if it's possible to run it...)

3) other?

I would like to optimize the discovery of lowly expressed transcripts. Do you have any idea? All feedback is welcome.

Thank you very much

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by joao.curado0
gravatar for joao.curado
4.0 years ago by
joao.curado0 wrote:

Thank you very much for your answer. It's blood samples and the sequence is not so deep, 80 million reads unstranded (but, as expected, a big portion maps to the hemoglobin genes...)

So you think it's better to run the samples separately and merge after. What is the advantage in comparison with option 2?

ADD COMMENTlink written 4.0 years ago by joao.curado0

Blood Samples, without haemoglobin depletion means that you'll probably see a very large proportion map to haemoglobin genes! The general consensus is that performing it sample by sample is the usual protocol, and it's usually because there's person to person variance, and merging may mask a transcript that's unique to a small proportion of people (for example). If you do find that you're clutching at straws, you could always merge the Normal samples, run transcript discovery, repeat for your disease state, merge the two, quantify. 80 Million reads isn't bad, the major issue you'll have is the proportion that map to haemoglobin unfortunately.

If I was in your situation here's what I'd do:

First Step:

  • Quantify against reference using Kallisto or Salmon

  • Read into DESeq2

  • Check proportion of reads that map to haemoglobin

Second Step:

  • Differential Expression at Gene Level using DESeq2 (With your sample size, you may have to use Limma Voom - You'll have to see how it goes)

  • Differential Transcript Expression using Sleuth - See what you get, you might be surprised.

Third Step

  • Run Cufflinks, or StringTie on each sample

  • Merge to get merged GTF

  • Create new Kallisto or Salmon Transcriptome Index against the new GTF

  • Quantify fastq against the "Novel" transcriptome Index

  • Sleuth Differential Transcript Expression

  • If you find anything, you'll basically need to check them out in IGV and decide if they look real. Chase up using qPCR, or whatever method you want.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by andrew.j.skelton735.9k

this is very helpful Andrew. I'll give it a try. Thanks for the feedback. I'll say how it went in the end.

ADD REPLYlink written 4.0 years ago by joao.curado0
gravatar for andrew.j.skelton73
4.0 years ago by
andrew.j.skelton735.9k wrote:

Novel discovery is always a tricky task. If you haven't done already, I'd optimise your process of analysing your known genes / transcripts before you start at looking into novel elements. Cufflinks and StringTie are your major two options, run them on each of your samples, then merge them. Once they're merged, you have a GTF file that shows every potential transcript in all your samples (heads up, it'll likely be a lot). You can then quantify against this GTF file and perform differential expression. Finding lowly expressed novel transcripts from RNA Seq data is an extremely difficult challenge, as the nature of short read sequencing makes it very difficult in the first place! Depending on the depth of your sequencing will decide what you can essentially discover, and quantify. Whatever disease you're studying, and what you're sampling (blood, tissue specific, isolated cells), will also play into the equation of what you can find too. The good news is that you have a pretty decent sample size with 100 samples!

ADD COMMENTlink written 4.0 years ago by andrew.j.skelton735.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1917 users visited in the last hour