I'm rather new to RNA-seq analysis and more familiar with DNA sequencing.
I'd like to perform de-novo assembly of transcripts from publicly-available (i.e published) RNA-seq data in tomato and its wild relative. the reasons I need to do that are:
a. I want to discover novel genes not present in the reference genome.
b. Wild relatives have no reliable reference.
The final purpose is actually genome annotation, and assembled transcripts will be used as input for annotation pipelines.
Now, there's quite a lot of raw data out there resulting from RNA-seq experiments. My problem is that I don't know which data sets are suitable for de-novo assembly. I know that many studies are designed to quantify expression levels of pre-defined genes, but I am currently not interested in that and would just like to get a sense of what data can be used for my purpose in terms of:
- Sequencing coverage
- Read length (in short-red and long-read technologies data)
- Strand-specific sequencing
- Other factors I'm not aware of?
I guess there are no definitive answers here, but there should be some standard. For example, in DNA genome assembly, you can't do much with, say 5x coverage. But I understand that for transcript-assembly, too deep is also a problem (although quite easy to solve). That's the kind of advice I'm looking for.
Thank you very much!
- Sequence coverage:
Difficult to asses. there is not really an upper or lower limit I feel and usually it's a matter of costs. On the other hand it's often hard to reliable estimate the expected or wanted coverage as estimating the 'transcribed' genome part is not straightforward. Here you are in bit of a blessed case as you can use the tomato reference as a proxy. To much coverage should not pose to many problems as most transcript assembly tools should be able to deal with it as well as with very uneven coverage on a per transcript basis (with DNA seq you expect somewhat even coverage all over the genome, with transcriptome you will have much more variation (biological reasons))
- Read length
here it's the longer the better (surprise surprise ;) ), if you have a choice I would certainly go for paired end reads and rather 150bp (or even 250bp) than 75 or so. If you have to possibility to go for long read technologies (ONT, PacBio) those are certainly preferred over any short read data, even if they come with lower coverage.
- strand specific
Nice to have but not really crucial I would say. if you do have that kind of data make sure you use an assembly approach/tool that takes this kind of information into account.