Question

Is this a good way to find non-coding transcripts from stranded RNA-seq data?

2

Entering edit mode

4.4 years ago

c_u ▴ 520

I have Total RNA TrueSeq Illumina Stranded library (human). My goal is to find novel (and non-novel) non-coding transcripts in my data (experimental vs control).

After a LOT of Google-fu and asking questions on this website, this is the methodology that I am currently using -

Align the fasta files with STAR to hg38
Assemble transcripts for each sample, merge transcripts from all samples (to get a unified transcriptome that represents all the samples), and estimate transcript abundances - all using Stringtie (protocol paper - https://www.nature.com/articles/nprot.2016.095#procedure)
Use tximport to infer integer counts from the Stringtie transcript abundances and export it to DESeq2.

I wanted to know if this methodology makes sense. Is there anything for which a better method makes more sense. I hope my question is not too broad, given that I do specify the exact pipeline I am employing :)

RNA-Seq assembly • 2.6k views

ADD COMMENT • link updated 4.4 years ago by padwalmk ▴ 140 • written 4.4 years ago by c_u ▴ 520

0

Entering edit mode

In a previous question (https://www.biostars.org/p/407788/), I did discover that strand information is not something that matters while mapping with STAR

ADD REPLY • link 4.4 years ago by c_u ▴ 520

1

Entering edit mode

4.4 years ago

Kristoffer Vitting-Seerup ★ 4.0k

You approach makes perfectly sense (I've written about some of the details of your workflow here if you want to doublecheck). Now you just need to the novel transcripts and determine if they are coding or not.

Novel transcripts will be called "MSTRG" by StringTie (unless you change it). Known transcripts will be called whatever they are called in the refrence you provided (e.g. "ENST" for human ensemble refrence).

As for coding vs non-coding there are some tools for that CPAT and CPC2 springs to mind.

Also remember a lot of datasets have isoform switches where a coding transcript is used on one condition and then there is a switch to a non-coding in another condition. If you have conditional data my R pakcage IsoformSwitchAnalyzeR would easily let you analyze such switches (directly using the data you already have).

ADD COMMENT • link 4.4 years ago by Kristoffer Vitting-Seerup ★ 4.0k

1

Entering edit mode

4.4 years ago

Konstantinos Yeles ▴ 110

Maybe you could use the workflow of derfinder package

If you are familiar with recount2 and RailRNAenter link description here aligner

is an annotation agnostic workflow that probably could be utilised to find novel transcripts.

ADD COMMENT • link 4.4 years ago by Konstantinos Yeles ▴ 110

0

Entering edit mode

4.4 years ago

swbarnes2 14k

I don't understand why the stringtie --fr or --fr command line options won't do what you need.

ADD COMMENT • link 4.4 years ago by swbarnes2 14k

0

Entering edit mode

Thanks swbarnes2! I looked up the --fr option and it does look appropriate for my case. I can modify the question to remove that part. My main question here is about the whole pipeline itself, whether it makes sense. Thank you!!

ADD REPLY • link 4.4 years ago by c_u ▴ 520

score 7 · Accepted Answer · 2019-11-21

Hi, you can follow the below given pipeline for complete analyses

Algin reads to the genome using the STAR (Already completed)
Run stringtie to assemble transcripts with default parameters which (filter out transcript below 1 FPKM)
Merge all transcriptome assemlby into 1 merged assembly ( stringtie --merge)
Extract protein coding genes in gtf format from ENSEMBLE GTF
Download already known noncoding RNAs ( ENSEMBLE + LNCIPEDIA)
Merge all known noncoding annotation in to 1 gtf file by ( using cuffmerge)
gffcompare the protein coding and all knonw protein codng gtf to your assembled merged gtf file. 8 Extract the classes "u", "i" and "X" with bash script (For more details check gffcompare classes.)
Now extract the MSTRG id and extract gtf file for this using the grep or something. 10 Extract fasta file for novel noncoding using gffread
Filter the transcript wiht length less then 200 and single exons transcript.
Use CPAT and PLEK to predit the coding potenial for the novel transcripts using the transcript sequence.
Further filter the the transcritp by using NCBI blast with known protein homology based ( use only 3 frames)
This would be your final list for the transcripts. 15 Rerun the stringtie with -e -B option to get count in FPKM, Coverage and TPM.

16 Run DESEQ2 For differential expression

Annotate the novel transcript by looking into near by genes by bedoops closet tool.
Further find out signifigance by creating the corrleation beteween protein coding and your novel transcripts.

19 Gene enrchimnent anlaysis of the near by genes and protein coding gens for understaing mechanism