Question: Stringtie to DESeq2
2
gravatar for gozrom
12 months ago by
gozrom40
gozrom40 wrote:

Hi, there, I've generated gtf files with Stringie as been explained in Hisat Stringtie Ballgown paper by pertea 2016 doi: 10.1038/nprot.2016.095 https://www.ncbi.nlm.nih.gov/pubmed/27560171

Now, since ballgown doesn't, currently support TPM, I wanted to use DESeq2 in R to do DE analysis,

First I need to extract reads from Stringie generated gtf files.

I found this nice page that explains how to do that with python script that they also provide: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#deseq

Now, when I run the python script with -i option and provide the list path to python, or without the list option I get an error

python prepDE.py -i phenotype_2.txt
File "prepDE.py", line 32 print "Error: Text file with sample ID and path invalid (%s)" % (line.strip()) ^ SyntaxError: invalid syntax

The script can be checked here: http://ccb.jhu.edu/software/stringtie/dl/prepDE.py

So it doesn't recognize the list and it doesn't see the folders with gtf files, although it is build according to the paper.

Any suggestions are appreciated, thank you.

rna-seq R • 2.9k views
ADD COMMENTlink modified 3 months ago by kashifahmad7500 • written 12 months ago by gozrom40

What does your phenotype_2.txt file look like? The error seems to be saying that it is improperly formatted. It should look like the sample on the page you linked, with sample ID and full path separated by a tab.

ADD REPLYlink written 12 months ago by Russ450

It is looks like that, And this error happens when the list is not supplied as well, thanks for the reply though...

`

ADD REPLYlink modified 12 months ago • written 12 months ago by gozrom40

Can you paste a few lines of your phenotype_2.txt?

**EDIT nevermind, look at Kevin's response below. You are using Python3+ and the script is meant for python2.7.

ADD REPLYlink modified 12 months ago • written 12 months ago by Damian Kao15k
1

amusingly it says:

MIN_PYTHON = (2, 7)
if sys.version_info < MIN_PYTHON:
    sys.exit("Python %s.%s or later is required.\n" % MIN_PYTHON)

but clearly not that "later" 3.0 won't work.

And since there never will be a Python 2.8 this is effectively a Python 2.7 only script.

For the OP install and run it with Python 2 (use conda or some other environment manager to keep the different python's from interfering)

ADD REPLYlink modified 12 months ago • written 12 months ago by Istvan Albert ♦♦ 80k

Not a direct answer to the problem mentioned in your question, but, why not just use your StringTie-generated GTF with your original FASTQ files in order to determine raw read count abundances using Kallisto or Salmon, or use the StringtTie-generated GTF with your aligned BAMs and do the same with featureCounts or BEDTools coverage. Then, input the raw read count abundances into DESeq2 for normalisation and differential expression analysis...

Kevin

ADD REPLYlink written 12 months ago by Kevin Blighe41k

That's a nice option I thought about that. Just don't know how it works yet, need to check the documentation on any of those tools, I thought using htseq with the hisat aligned files but I did not find that htseq output TPM too... so basically it's easier to use something that I already made and not to run tools that I don't know and troubleshoot them most of the time till I get the results...I'll do that if I don't have a choice...

ADD REPLYlink written 12 months ago by gozrom40

Do you definitively need TPM counts?

HTSeq, Kallisto, Salmon, etc., produce raw counts, which are then input to and normalised in EdgeR or DESeq2.

ADD REPLYlink written 12 months ago by Kevin Blighe41k

I definitely need TPM counts for single replicates, raw counts are fine to use in DESeq2, for DE purposes only, but when presenting the data you present read counts, and since TPM is more standardized across replicates I do need those and not FPKM

ADD REPLYlink modified 12 months ago • written 12 months ago by gozrom40

But EdgeR will normalise based on TMM (trimmed mean of M), whilst DESeq2 performs a 'geometric' normalisation. Both programs produce data that is suitable for visualisation, too.

ADD REPLYlink modified 11 months ago • written 12 months ago by Kevin Blighe41k

Alright, that's fine, I still need to solve the issue of how to extract the counts from Stringtie, I haven't got yet to the DESeq2 point, so it's not an issue right now, extracting the counts from Stringtie files will produces CSV file with all the features...

ADD REPLYlink written 12 months ago by gozrom40
2

Just run featureCounts with the GTF produced by stringTie and call it done. Then you know the counting was done correctly and DESeq2/edgeR/whatever will be happy.

ADD REPLYlink written 12 months ago by Devon Ryan89k

FeatureCounts is not available for windows pc and works only with sam and bam files https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/featureCounts http://subread.sourceforge.net/

Thanks for all the workaround suggestions, I'm trying to solve one issue to use python provided script to extract the reads from stringtie generated gtf file, please don't suggest any more workarounds...

ADD REPLYlink written 12 months ago by gozrom40

Okay, I presume that your input file follows the recommended format given on the page to which you've linked: http://ccb.jhu.edu/software/stringtie/dl/sample_lst.txt ?

That Python script is [hopefully] just doing exactly the same as what both Devon and I have been suggesting, i.e., to derive raw counts from your files and using your StringTie-generated GTF as reference, but it does it in an automated way. DESeq2, then, will just produce counts normalised by geometic mean and not TPM.

You should already have TPM counts in your StringTie output.

ADD REPLYlink written 12 months ago by Kevin Blighe41k

Yes, TPM counts are there. Thank you I knew that.

ADD REPLYlink written 12 months ago by gozrom40
2

Take a look here: Stringtie prepDE.py Error line 32

ADD REPLYlink written 12 months ago by Kevin Blighe41k
1

@Kevin Blighe, that seem to work with python 2.7, I have other issue, that it doesn't see gtf file, I'll try troubleshooting it myself first, Thanks for the help.

ADD REPLYlink written 12 months ago by gozrom40
1

@Kevin Great it works. Thank you

ADD REPLYlink written 12 months ago by gozrom40

Hello dear, i am very new to Linux and stringtie pipeline for RNA seq. i finished the stringtie protocol upto yhe step 6| Estimate transcript abundances and create table counts for Ballgown, after that we can switch to DEseq for differential expression analysis. i much tried but donot understand how to do this steps as mentioned. There are two possible ways to provide input: one option is to provide a directory with the same structure as created in the StringTie protocol paper in preparation for Ballgown. By default (no -i option), this is located in the working directory and named ballgown (see below). ./ballgown/sample1/sample1.gtf ./ballgown/sample2/sample2.gtf ./ballgown/sample3/sample3.gtf

Alternatively, one can provide a textfile listing sample IDs and their respective paths (sample_lst.txt). kindly guide me. i shall be very thankfull

ADD REPLYlink written 3 months ago by kashifahmad7500

Please do not add multiple reactions in old threads. The best would probably be to open a separate question for this.

ADD REPLYlink written 3 months ago by WouterDeCoster38k

Hello anyone can provide the R script for DEseq2 for differential expression analysis using after the stringtie??? thanks in advance

ADD REPLYlink written 3 months ago by kashifahmad7500

Please do not add multiple reactions in old threads. The best would be to read the DESeq2 manual. There is a lot of documentation. If you have questions after that you should open a separate question for this.

ADD REPLYlink written 3 months ago by WouterDeCoster38k
3
gravatar for kristoffer.vittingseerup
3 months ago by
European Union
kristoffer.vittingseerup1.7k wrote:

You can simply use tximport() to import the StringTie quantification data into R - under the hood it does the same calculations as the python script provided by on the StringTie homepage. The appropriate section of the tximport vignette can be found here.

Please note:

  1. With regards to consideration of how to do quantification I wrote a discussed here.
  2. Remember that with Isoform level analysis such as the one you have here also allows you to do analysis of e.g. isoform switches - something my R package IsoformSwitchAnalyzeR can help you with. You can find examples of what type of analysis you can do in this section of the vignette.
ADD COMMENTlink written 3 months ago by kristoffer.vittingseerup1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 888 users visited in the last hour