Question: Is this a good way to find non-coding transcripts from stranded RNA-seq data?
1
gravatar for c_u
28 days ago by
c_u140
United States
c_u140 wrote:

I have Total RNA TrueSeq Illumina Stranded library (human). My goal is to find novel (and non-novel) non-coding transcripts in my data (experimental vs control).

After a LOT of Google-fu and asking questions on this website, this is the methodology that I am currently using -

  1. Align the fasta files with STAR to hg38
  2. Assemble transcripts for each sample, merge transcripts from all samples (to get a unified transcriptome that represents all the samples), and estimate transcript abundances - all using Stringtie (protocol paper - https://www.nature.com/articles/nprot.2016.095#procedure)
  3. Use tximport to infer integer counts from the Stringtie transcript abundances and export it to DESeq2.

I wanted to know if this methodology makes sense. Is there anything for which a better method makes more sense. I hope my question is not too broad, given that I do specify the exact pipeline I am employing :)

rna-seq assembly • 242 views
ADD COMMENTlink modified 23 days ago by padwalmk20 • written 28 days ago by c_u140

In a previous question (https://www.biostars.org/p/407788/), I did discover that strand information is not something that matters while mapping with STAR

ADD REPLYlink modified 28 days ago • written 28 days ago by c_u140
3
gravatar for padwalmk
23 days ago by
padwalmk20
padwalmk20 wrote:

Hi, you can follow the below given pipeline for complete analyses

  1. Algin reads to the genome using the STAR (Already completed)
  2. Run stringtie to assemble transcripts with default parameters which (filter out transcript below 1 FPKM)
  3. Merge all transcriptome assemlby into 1 merged assembly ( stringtie --merge)
  4. Extract protein coding genes in gtf format from ENSEMBLE GTF
  5. Download already known noncoding RNAs ( ENSEMBLE + LNCIPEDIA)
  6. Merge all known noncoding annotation in to 1 gtf file by ( using cuffmerge)
  7. gffcompare the protein coding and all knonw protein codng gtf to your assembled merged gtf file. 8 Extract the classes "u", "i" and "X" with bash script (For more details check gffcompare classes.)

  8. Now extract the MSTRG id and extract gtf file for this using the grep or something. 10 Extract fasta file for novel noncoding using gffread

  9. Filter the transcript wiht length less then 200 and single exons transcript.
  10. Use CPAT and PLEK to predit the coding potenial for the novel transcripts using the transcript sequence.
  11. Further filter the the transcritp by using NCBI blast with known protein homology based ( use only 3 frames)
  12. This would be your final list for the transcripts. 15 Rerun the stringtie with -e -B option to get count in FPKM, Coverage and TPM.

16 Run DESEQ2 For differential expression

  1. Annotate the novel transcript by looking into near by genes by bedoops closet tool.
  2. Further find out signifigance by creating the corrleation beteween protein coding and your novel transcripts.

19 Gene enrchimnent anlaysis of the near by genes and protein coding gens for understaing mechanism

ADD COMMENTlink modified 23 days ago • written 23 days ago by padwalmk20
1
gravatar for kristoffer.vittingseerup
26 days ago by
European Union
kristoffer.vittingseerup2.9k wrote:

You approach makes perfectly sense (I've written about some of the details of your workflow here if you want to doublecheck). Now you just need to the novel transcripts and determine if they are coding or not.

Novel transcripts will be called "MSTRG" by StringTie (unless you change it). Known transcripts will be called whatever they are called in the refrence you provided (e.g. "ENST" for human ensemble refrence).

As for coding vs non-coding there are some tools for that CPAT and CPC2 springs to mind.

Also remember a lot of datasets have isoform switches where a coding transcript is used on one condition and then there is a switch to a non-coding in another condition. If you have conditional data my R pakcage IsoformSwitchAnalyzeR would easily let you analyze such switches (directly using the data you already have).

ADD COMMENTlink modified 26 days ago • written 26 days ago by kristoffer.vittingseerup2.9k
1
gravatar for Konstantinos Yeles
26 days ago by
Italy
Konstantinos Yeles80 wrote:

Maybe you could use the workflow of derfinder package

If you are familiar with recount2 and RailRNAenter link description here aligner

is an annotation agnostic workflow that probably could be utilised to find novel transcripts.

ADD COMMENTlink written 26 days ago by Konstantinos Yeles80
0
gravatar for swbarnes2
28 days ago by
swbarnes27.0k
United States
swbarnes27.0k wrote:

I don't understand why the stringtie --fr or --fr command line options won't do what you need.

ADD COMMENTlink written 28 days ago by swbarnes27.0k

Thanks swbarnes2! I looked up the --fr option and it does look appropriate for my case. I can modify the question to remove that part. My main question here is about the whole pipeline itself, whether it makes sense. Thank you!!

ADD REPLYlink modified 28 days ago • written 28 days ago by c_u140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1005 users visited in the last hour