Question

RNA-seq long reads vs. transcripts

1

Entering edit mode

6.2 years ago

tunl ▴ 80

Long reads technologies claim that their RNA-seq long reads are full-length transcripts, so transcript assembly is no longer needed.

I am wondering what may be the differences between long reads and transcripts: are they truly equivalent?

If I want to get transcripts (in GTF format) from long reads, what may be the possible ways to do that?

If I use reference-based transcript assemblers (such as StringTie, Cufflinks) to get transcripts from long reads, I wonder what kind of “assembly” the assembler does if the long reads are already full-length transcripts by themselves? Are the transcripts assembled by those assemblers different from the original long reads?

Your advice and suggestions would be greatly appreciated!

Thank you very much!

RNA-Seq long reads assembly • 4.2k views

ADD COMMENT • link 6.2 years ago by tunl ▴ 80

1

Entering edit mode

I am wondering what may be the differences between long reads and transcripts: are they truly equivalent?

Assembled long transcripts may be a good (enough) approximation that will depend on your standards for acceptance. Reconstructing original long transcripts from smaller fragments is a tricky business especially when they share long stretches of sequence (exons).

If you are truly looking for long/validated transcripts then PacBio's Iso-Seq may be your proven option. While nanopore will give you long reads there is still a higher error rate associated so your results may be more error prone.

ADD REPLY • link 6.2 years ago by GenoMax 141k

1

Entering edit mode

The recent paper for LoReAn (Long Read Annotation) has a different approach for this, what they do is run a standard short read annotation pipeline (BRAKER with EvidenceModeler and PASA etc.), then map Iso-Seq data using GMAP to fix and finalise the gene models, identify alternative splicing, get UTRs etc. They don't even bother with assembling the Iso-Seq data.

https://www.biorxiv.org/content/early/2017/12/08/230359

ADD REPLY • link 6.2 years ago by Philipp Bayer 8.3k

0

Entering edit mode

This is great information! Thank you so much for your advice! I’ll look into this paper. Thanks a lot!

ADD REPLY • link 6.2 years ago by tunl ▴ 80

0

Entering edit mode

Thank you very much for your advice! I was wondering if there are some ways to get transcripts (in GTF format) from long reads instead of running assemblers on them?

I tried running different assemblers on the same long-read dataset (PacBio) and got very different results (some assembler outputs much more transcripts from the long reads than the other). So I am not sure which assembly output more closely represents the original long reads.

Since they claim assembly is no longer needed for long reads, I would assume there may be some ways to get the transcripts (in GTF format) directly from long reads without running assembly?

Thanks a lot!

ADD REPLY • link 6.2 years ago by tunl ▴ 80

3

Entering edit mode

GTF is a format used to describe gene annotations. It will come after a transcript has been validated as complete/reasonable (by any means you find acceptable). You can't directly get GTF format information from raw/assembled reads (if I am getting the aim of your comment above correct) unless additional validation is done on them.

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

Thanks so much! So this means we would have to run an assembler on the long reads in order to automatically get transcripts in GFT from them (or are there some other automated ways to do so?). However, as I found out that different assemblers got quite different outputs from the same long read dataset, it seems that the assembled results can deviate quite significantly from the original long reads depending on the algorithm used in the assembler. So I am still looking for an automated way to get transcripts in GFT from long reads without running assemblers, if it ever exists. Thanks again!

ADD REPLY • link 6.2 years ago by tunl ▴ 80

0

Entering edit mode

Assembling transcripts is half the battle. You need to validate them by blasting against a reference genome/transcriptome/nr/nt database. If you have a closely related genome/transcriptome available then your job would be easier otherwise be ready to do plenty of careful analysis. There would not be a lot "automatic" about this process.

That said, recovering very long transcripts is possible. Not sure what kind/quality of data you have and what organism you are working with.

ADD REPLY • link 6.2 years ago by GenoMax 141k