Question

Novel transcripts from monoculture RNAseq applied to tissue-level RNAseq

0

Entering edit mode

4.2 years ago

matt.a.bennett25890 ▴ 20

Hi all,

Bit of an essay apologies...

I've applied a novel transcript discovery pipeline to RNAseq data derived from a cell type grown as a monoculture with/without treatment in vitro, particularly focusing on lncRNAs:

Basic pipeline (open to comments/criticism!): STAR mapping of reads to GENCODEv26 indexed hg38 -> remove non-expressed transcripts from GENCODEv26 -> StringTie to merge abundance-filtered GENCODEv26 transcripts with new transcripts -> remove known ORFs -> CPC/HMMER/RNAcode -> annotated + new lncRNAs

This has yielded some nice data, seeing ~35% newly assembled lncRNAs in my differentially expressed genes. I also have basically a customised specific annotation for this cell type too.

Would now like to see relevance for some in vivo data, I have found tissue-level data which will contain a variable amount of the cell type I started with as well as others. My approach so far would be to RSEM the reads in this dataset to my new customised annotation. A bit messy, but I think enough to show my lncs are active in a real world situation though I'm also having doubts which any comments on below questions may aid!

1)Are there any approaches to estimate cellular make up in tissue-level data based on cell-specific markers?

2)Is this just too naive an approach to be useful?

3)Could run the pipeline again on the in vivo dataset but it isn't stranded... would this mess up transcript discovery too much?

Would appreciate any input, thanks for reading :)

RNA-Seq transcript discovery • 705 views

ADD COMMENT • link updated 4.2 years ago by benformatics 3.9k • written 4.2 years ago by matt.a.bennett25890 ▴ 20

score 1 · Answer 1 · 2020-02-04

2) Something that wasn't clear but you should absolutely do (if you haven't already) is overlap your "novel" elements with elements in the current GENOCDE annotations... there is a release v33 available. You could also check RefSeq and Ensembl annotations. This will answer the question "Are these transcript actually novel?" Also why are you removing only known ORFs - shouldn't you be removing the whole transcript/cDNA including the 3' and 5' UTRs?

1) Estimating cellular makeup from bulk RNA-seq is difficult (but there are methods available using knowledge gained from scRNA-seq) and one of the major drivers behind the rise of single-cell sequencing.

3) Yes and no. If your novel transcripts are outside of known genes then it would work. However at all points you would need to treat your datasets as if all reads and discovered transcripts were potentially from the same strand (e.g. both + and -; essentially unknown or *)