Question: RNA-seq data for de-novo transcript assembly
gravatar for liorglic
23 months ago by
liorglic130 wrote:

I'm rather new to RNA-seq analysis and more familiar with DNA sequencing.
I'd like to perform de-novo assembly of transcripts from publicly-available (i.e published) RNA-seq data in tomato and its wild relative. the reasons I need to do that are:
a. I want to discover novel genes not present in the reference genome.
b. Wild relatives have no reliable reference.
The final purpose is actually genome annotation, and assembled transcripts will be used as input for annotation pipelines.
Now, there's quite a lot of raw data out there resulting from RNA-seq experiments. My problem is that I don't know which data sets are suitable for de-novo assembly. I know that many studies are designed to quantify expression levels of pre-defined genes, but I am currently not interested in that and would just like to get a sense of what data can be used for my purpose in terms of:
- Sequencing coverage
- Read length (in short-red and long-read technologies data)
- Strand-specific sequencing
- Other factors I'm not aware of?
I guess there are no definitive answers here, but there should be some standard. For example, in DNA genome assembly, you can't do much with, say 5x coverage. But I understand that for transcript-assembly, too deep is also a problem (although quite easy to solve). That's the kind of advice I'm looking for.
Thank you very much!

rna-seq rna assembly • 929 views
ADD COMMENTlink modified 23 months ago by lieven.sterck6.9k • written 23 months ago by liorglic130

The reference assembly for the tomato genome should be quite OK, do you have any reason to expect you might find substantial amount of novel genes not present in the genome?

ADD REPLYlink written 23 months ago by lieven.sterck6.9k

Yes, if I look at other varieties/cultivars other than the one used to produce the reference (Heinz). This had not yet been done in tomato, but in other organisms (e.g rice and maize) non-reference cultivars showed a substantial amount of novel genes not found in the reference.

ADD REPLYlink written 23 months ago by liorglic130

Looking for resistance genes, are we? :P

ADD REPLYlink written 23 months ago by cschu1812.1k

No, not particularly...

ADD REPLYlink written 23 months ago by liorglic130

Was worth a shot >:D

ADD REPLYlink written 23 months ago by cschu1812.1k
gravatar for lieven.sterck
23 months ago by
VIB, Ghent, Belgium
lieven.sterck6.9k wrote:
  • Sequence coverage:

Difficult to asses. there is not really an upper or lower limit I feel and usually it's a matter of costs. On the other hand it's often hard to reliable estimate the expected or wanted coverage as estimating the 'transcribed' genome part is not straightforward. Here you are in bit of a blessed case as you can use the tomato reference as a proxy. To much coverage should not pose to many problems as most transcript assembly tools should be able to deal with it as well as with very uneven coverage on a per transcript basis (with DNA seq you expect somewhat even coverage all over the genome, with transcriptome you will have much more variation (biological reasons))

  • Read length

here it's the longer the better (surprise surprise ;) ), if you have a choice I would certainly go for paired end reads and rather 150bp (or even 250bp) than 75 or so. If you have to possibility to go for long read technologies (ONT, PacBio) those are certainly preferred over any short read data, even if they come with lower coverage.

  • strand specific

Nice to have but not really crucial I would say. if you do have that kind of data make sure you use an assembly approach/tool that takes this kind of information into account.

ADD COMMENTlink written 23 months ago by lieven.sterck6.9k

Thank you. This is very helpful.
As I said, I'm not planning on producing new RNA-seq data right now, but rather use data available from various DBs. So for example I found a data set comprised of ~2.8Gb sequencing data, with reads of length 61. Would you consider assembling this or would you say it's not enough?

ADD REPLYlink written 23 months ago by liorglic130

If it's the only one you have for a certain cultivar or experiment then yes I would consider. People have been doing this (successfully) when 61bp was the only read length available. If on the other hand you also have longer read data for the same setup I would prefer those (or consider merging them).

ADD REPLYlink written 23 months ago by lieven.sterck6.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1193 users visited in the last hour