Stringtie to DESeq2
2
5
Entering edit mode
3.8 years ago
gozrom ▴ 80

Hi, there, I've generated gtf files with Stringie as been explained in Hisat Stringtie Ballgown paper by pertea 2016 doi: 10.1038/nprot.2016.095 https://www.ncbi.nlm.nih.gov/pubmed/27560171

Now, since ballgown doesn't, currently support TPM, I wanted to use DESeq2 in R to do DE analysis,

First I need to extract reads from Stringie generated gtf files.

I found this nice page that explains how to do that with python script that they also provide: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#deseq

Now, when I run the python script with -i option and provide the list path to python, or without the list option I get an error

python prepDE.py -i phenotype_2.txt
File "prepDE.py", line 32 print "Error: Text file with sample ID and path invalid (%s)" % (line.strip()) ^ SyntaxError: invalid syntax

The script can be checked here: http://ccb.jhu.edu/software/stringtie/dl/prepDE.py

So it doesn't recognize the list and it doesn't see the folders with gtf files, although it is build according to the paper.

Any suggestions are appreciated, thank you.

RNA-Seq R • 8.9k views
ADD COMMENT
1
Entering edit mode

Not a direct answer to the problem mentioned in your question, but, why not just use your StringTie-generated GTF with your original FASTQ files in order to determine raw read count abundances using Kallisto or Salmon, or use the StringtTie-generated GTF with your aligned BAMs and do the same with featureCounts or BEDTools coverage. Then, input the raw read count abundances into DESeq2 for normalisation and differential expression analysis...

Kevin

ADD REPLY
0
Entering edit mode

That's a nice option I thought about that. Just don't know how it works yet, need to check the documentation on any of those tools, I thought using htseq with the hisat aligned files but I did not find that htseq output TPM too... so basically it's easier to use something that I already made and not to run tools that I don't know and troubleshoot them most of the time till I get the results...I'll do that if I don't have a choice...

ADD REPLY
0
Entering edit mode

Do you definitively need TPM counts?

HTSeq, Kallisto, Salmon, etc., produce raw counts, which are then input to and normalised in EdgeR or DESeq2.

ADD REPLY
0
Entering edit mode

I definitely need TPM counts for single replicates, raw counts are fine to use in DESeq2, for DE purposes only, but when presenting the data you present read counts, and since TPM is more standardized across replicates I do need those and not FPKM

ADD REPLY
0
Entering edit mode

But EdgeR will normalise based on TMM (trimmed mean of M), whilst DESeq2 performs a 'geometric' normalisation. Both programs produce data that is suitable for visualisation, too.

ADD REPLY
0
Entering edit mode

Alright, that's fine, I still need to solve the issue of how to extract the counts from Stringtie, I haven't got yet to the DESeq2 point, so it's not an issue right now, extracting the counts from Stringtie files will produces CSV file with all the features...

ADD REPLY
3
Entering edit mode

Just run featureCounts with the GTF produced by stringTie and call it done. Then you know the counting was done correctly and DESeq2/edgeR/whatever will be happy.

ADD REPLY
0
Entering edit mode

What does your phenotype_2.txt file look like? The error seems to be saying that it is improperly formatted. It should look like the sample on the page you linked, with sample ID and full path separated by a tab.

ADD REPLY
0
Entering edit mode

It is looks like that, And this error happens when the list is not supplied as well, thanks for the reply though...

`

ADD REPLY
0
Entering edit mode

Can you paste a few lines of your phenotype_2.txt?

**EDIT nevermind, look at Kevin's response below. You are using Python3+ and the script is meant for python2.7.

ADD REPLY
1
Entering edit mode

amusingly it says:

MIN_PYTHON = (2, 7)
if sys.version_info < MIN_PYTHON:
    sys.exit("Python %s.%s or later is required.\n" % MIN_PYTHON)

but clearly not that "later" 3.0 won't work.

And since there never will be a Python 2.8 this is effectively a Python 2.7 only script.

For the OP install and run it with Python 2 (use conda or some other environment manager to keep the different python's from interfering)

ADD REPLY
0
Entering edit mode

FeatureCounts is not available for windows pc and works only with sam and bam files https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/featureCounts http://subread.sourceforge.net/

Thanks for all the workaround suggestions, I'm trying to solve one issue to use python provided script to extract the reads from stringtie generated gtf file, please don't suggest any more workarounds...

ADD REPLY
0
Entering edit mode

Okay, I presume that your input file follows the recommended format given on the page to which you've linked: http://ccb.jhu.edu/software/stringtie/dl/sample_lst.txt ?

That Python script is [hopefully] just doing exactly the same as what both Devon and I have been suggesting, i.e., to derive raw counts from your files and using your StringTie-generated GTF as reference, but it does it in an automated way. DESeq2, then, will just produce counts normalised by geometic mean and not TPM.

You should already have TPM counts in your StringTie output.

ADD REPLY
0
Entering edit mode

Yes, TPM counts are there. Thank you I knew that.

ADD REPLY
2
Entering edit mode
ADD REPLY
1
Entering edit mode

@Kevin Blighe, that seem to work with python 2.7, I have other issue, that it doesn't see gtf file, I'll try troubleshooting it myself first, Thanks for the help.

ADD REPLY
1
Entering edit mode

@Kevin Great it works. Thank you

ADD REPLY
0
Entering edit mode

Hello dear, i am very new to Linux and stringtie pipeline for RNA seq. i finished the stringtie protocol upto yhe step 6| Estimate transcript abundances and create table counts for Ballgown, after that we can switch to DEseq for differential expression analysis. i much tried but donot understand how to do this steps as mentioned. There are two possible ways to provide input: one option is to provide a directory with the same structure as created in the StringTie protocol paper in preparation for Ballgown. By default (no -i option), this is located in the working directory and named ballgown (see below). ./ballgown/sample1/sample1.gtf ./ballgown/sample2/sample2.gtf ./ballgown/sample3/sample3.gtf

Alternatively, one can provide a textfile listing sample IDs and their respective paths (sample_lst.txt). kindly guide me. i shall be very thankfull

ADD REPLY
0
Entering edit mode

Please do not add multiple reactions in old threads. The best would probably be to open a separate question for this.

ADD REPLY
0
Entering edit mode

Hello anyone can provide the R script for DEseq2 for differential expression analysis using after the stringtie??? thanks in advance

ADD REPLY
0
Entering edit mode

Please do not add multiple reactions in old threads. The best would be to read the DESeq2 manual. There is a lot of documentation. If you have questions after that you should open a separate question for this.

ADD REPLY
3
Entering edit mode
3.0 years ago

You can simply use tximport() to import the StringTie quantification data into R - under the hood it does the same calculations as the python script provided by on the StringTie homepage. The appropriate section of the tximport vignette can be found here.

Please note:

  1. With regards to consideration of how to do quantification I wrote a discussed here.
  2. Remember that with Isoform level analysis such as the one you have here also allows you to do analysis of e.g. isoform switches - something my R package IsoformSwitchAnalyzeR can help you with. You can find examples of what type of analysis you can do in this section of the vignette.
ADD COMMENT

Login before adding your answer.

Traffic: 1813 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6