RNAseq - how to find novel transcripts in under different treatment using stringTie and gffcompare?
1
3
Entering edit mode
4.8 years ago
jingjin2203 ▴ 60

Hi all,

I was wondering how to find novel transcripts under different treatment conditions using stringTie?

I have 4 different treatments and 3 replications for each of the treatments in my RNAseq data. I have tried to merge the 12 gtf files generated by stringTie and compared the merged gtf to the reference gtf file using gffcompare. However, I am not sure what I should do if I would like to find out novel transcripts in different treatments. Should I combine the gtf files from the 3 replications for each treatment, and compare the combined gtf to reference gtf file? How can I make a comparison across the 4 different treatments? Does that make the question into finding out DEGs between different treatments?

Hope my silly questions make sense.

Thank you for your attention and help!

RNA-Seq transcripts • 3.8k views
ADD COMMENT
2
Entering edit mode
4.8 years ago
rmash ▴ 20

The stringTie pipeline generates a merged GTF which you then use to use as a reference to get transcript count tables for all your sample.

ADD COMMENT
0
Entering edit mode

Thank you for your kind help! Really appreciated it! Do I have to use DESeq2 to get transcript count tables?

ADD REPLY
0
Entering edit mode

You have to run StringTie again on the merged GTF file. You can see the suggested DE pipeline for StringTie here along with more instructions on how to do the analysis.

ADD REPLY
0
Entering edit mode

Thank you, Kristoffer! So, basically the process would be same for identifying novel transcripts and DEGs, is that correct?

ADD REPLY
0
Entering edit mode

Yes, except when you run on the merged GTF you run StringTie with the -eB options so that it will only quantify the isoforms in the GTF file (aka not look for new features this time around). Also note that instead of using the python script they provide for extracting quantification you can within R use tximport or IsoformSwitchAnalyzeR's importIsoformExpression().

ADD REPLY
0
Entering edit mode

Thanks a lot for your help! Really appreciated it! Just wanted to make sure I understand everything correctly, since my goal is to detect novel transcripts, -e (only estimate the abundance of given reference transcripts) should not be used, is that correct? Sorry about the silly questions, thanks again for your time and help!

ADD REPLY
1
Entering edit mode

You are right that the first time you run StringTie you do not want to use the -e option. But not using -e StringTie will predict novel transcripts. Afterwards you use StringTie --merge to concatenate the individual StringTie prediction into one combined set representing the known and novel transcripts from all samples. Then - to ensure you have quantified the same transcripts in all samples (else they are not comparable) - you run StringTie again on each sample using the GTF from the --merge run and with the -e option. The -e option will ensure you only quantify the transcripts in the GTF file - but since the gtf file both contain the known and novel transcripts this is exactly what you want. You can also read more about it here under option "B".

ADD REPLY
0
Entering edit mode

This is really helpful!! Thank you so much for help!!

ADD REPLY
0
Entering edit mode

Hi Kristoffer, sorry for keeping bugging you. I have successfully generated ctab files for each of my samples following your advice. I tried to analyze the data using IsoformSwitchAnalyzeR, but encountered some issue. I was wondering if you could help me fix it? I also had a question about which gtf file should be used as isoformExonAnnoation? The original one I used for the very first stringTie run? Or the merged gtf file from stringTie --merge? Thanks!!

aSwitchList <- importRdata( + isoformCountMatrix = stringTieQuant$counts, + isoformRepExpression = stringTieQuant$abundance, + designMatrix = myDesign, + isoformExonAnnoation = "merged.annotated.gtf", + isoformNtFasta = "scaffolds.fasta", + showProgress = FALSE + ) Step 1 of 6: Checking data... Step 2 of 6: Obtaining annotation... importing GTF (this may take a while) Step 3 of 6: Calculating gene expression and isoform fraction... 9520 ( 19.63%) isoforms were removed since they were not expressed in any samples. Error in sample.int(length(x), size, replace, prob) : invalid first argument In addition: Warning message: In importRdata(isoformCountMatrix = stringTieQuant$counts, isoformRepExpression = stringTieQuant$abundance, : No CDS annotation was found in the GTF files meaning ORFs could not be annotated. (But ORFs can still be predicted with the analyzeORF() function)

ADD REPLY
0
Entering edit mode

I read ?importRdata in R, it says isoformExonAnnoation can either be "A string indicating the full path to the (gziped or unpacked) GTF file which have been quantified". What is a quantified GTF file?

ADD REPLY
0
Entering edit mode

Hi Jingjin. Since it's not good practice to ask new questions as a comment (makes it impossible for other people to find answers) could you ask this as a new question. Alternatively you can also use the IsoformSwitchAnalyzeR google group. Before you post you do however need to make sure you have IsoformSwitchAnalyzeR >1.5.11.

ADD REPLY
0
Entering edit mode

Thanks, Kristoffer! Will do. And yes, I have IsoformSwitchAnalyzeR 1.6.0 installed.

ADD REPLY

Login before adding your answer.

Traffic: 2457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6