Question: Is this a good way to find non-coding transcripts from stranded RNA-seq data?
gravatar for c_u
4 months ago by
United States
c_u200 wrote:

I have Total RNA TrueSeq Illumina Stranded library (human). My goal is to find novel (and non-novel) non-coding transcripts in my data (experimental vs control).

After a LOT of Google-fu and asking questions on this website, this is the methodology that I am currently using -

  1. Align the fasta files with STAR to hg38
  2. Assemble transcripts for each sample, merge transcripts from all samples (to get a unified transcriptome that represents all the samples), and estimate transcript abundances - all using Stringtie (protocol paper -
  3. Use tximport to infer integer counts from the Stringtie transcript abundances and export it to DESeq2.

I wanted to know if this methodology makes sense. Is there anything for which a better method makes more sense. I hope my question is not too broad, given that I do specify the exact pipeline I am employing :)

rna-seq assembly • 369 views
ADD COMMENTlink modified 4 months ago by padwalmk80 • written 4 months ago by c_u200

In a previous question (, I did discover that strand information is not something that matters while mapping with STAR

ADD REPLYlink modified 4 months ago • written 4 months ago by c_u200
gravatar for padwalmk
4 months ago by
padwalmk80 wrote:

Hi, you can follow the below given pipeline for complete analyses

  1. Algin reads to the genome using the STAR (Already completed)
  2. Run stringtie to assemble transcripts with default parameters which (filter out transcript below 1 FPKM)
  3. Merge all transcriptome assemlby into 1 merged assembly ( stringtie --merge)
  4. Extract protein coding genes in gtf format from ENSEMBLE GTF
  5. Download already known noncoding RNAs ( ENSEMBLE + LNCIPEDIA)
  6. Merge all known noncoding annotation in to 1 gtf file by ( using cuffmerge)
  7. gffcompare the protein coding and all knonw protein codng gtf to your assembled merged gtf file. 8 Extract the classes "u", "i" and "X" with bash script (For more details check gffcompare classes.)

  8. Now extract the MSTRG id and extract gtf file for this using the grep or something. 10 Extract fasta file for novel noncoding using gffread

  9. Filter the transcript wiht length less then 200 and single exons transcript.
  10. Use CPAT and PLEK to predit the coding potenial for the novel transcripts using the transcript sequence.
  11. Further filter the the transcritp by using NCBI blast with known protein homology based ( use only 3 frames)
  12. This would be your final list for the transcripts. 15 Rerun the stringtie with -e -B option to get count in FPKM, Coverage and TPM.

16 Run DESEQ2 For differential expression

  1. Annotate the novel transcript by looking into near by genes by bedoops closet tool.
  2. Further find out signifigance by creating the corrleation beteween protein coding and your novel transcripts.

19 Gene enrchimnent anlaysis of the near by genes and protein coding gens for understaing mechanism

ADD COMMENTlink modified 4 months ago • written 4 months ago by padwalmk80
gravatar for kristoffer.vittingseerup
4 months ago by
European Union
kristoffer.vittingseerup3.0k wrote:

You approach makes perfectly sense (I've written about some of the details of your workflow here if you want to doublecheck). Now you just need to the novel transcripts and determine if they are coding or not.

Novel transcripts will be called "MSTRG" by StringTie (unless you change it). Known transcripts will be called whatever they are called in the refrence you provided (e.g. "ENST" for human ensemble refrence).

As for coding vs non-coding there are some tools for that CPAT and CPC2 springs to mind.

Also remember a lot of datasets have isoform switches where a coding transcript is used on one condition and then there is a switch to a non-coding in another condition. If you have conditional data my R pakcage IsoformSwitchAnalyzeR would easily let you analyze such switches (directly using the data you already have).

ADD COMMENTlink modified 4 months ago • written 4 months ago by kristoffer.vittingseerup3.0k
gravatar for Konstantinos Yeles
4 months ago by
Konstantinos Yeles100 wrote:

Maybe you could use the workflow of derfinder package

If you are familiar with recount2 and RailRNAenter link description here aligner

is an annotation agnostic workflow that probably could be utilised to find novel transcripts.

ADD COMMENTlink written 4 months ago by Konstantinos Yeles100
gravatar for swbarnes2
4 months ago by
United States
swbarnes27.5k wrote:

I don't understand why the stringtie --fr or --fr command line options won't do what you need.

ADD COMMENTlink written 4 months ago by swbarnes27.5k

Thanks swbarnes2! I looked up the --fr option and it does look appropriate for my case. I can modify the question to remove that part. My main question here is about the whole pipeline itself, whether it makes sense. Thank you!!

ADD REPLYlink modified 4 months ago • written 4 months ago by c_u200
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 915 users visited in the last hour