Question: From StringTie to DESeq2
0
gravatar for Morris_Chair
4 weeks ago by
Morris_Chair90
ISBReMIT
Morris_Chair90 wrote:

Dear all, After generating bunch of files with stringTie command line, I want to use DESeq2 for differential expression analysis. I understand that I have to create a matrix of read counts and the tool to use is prepDE.py,

prepDE.py -i list_gtf.txt

my output is

0 Treated_1
1 Treated_2
Traceback (most recent call last):
  File "/home/miniconda2/bin/prepDE.py", line 257, in <module>
    geneDict[geneIDs[i]][s[0]]+=v[s[0]]
KeyError: 'Treated_2'

Here is my list

[@ws7910 Files.GTF]$ cat list_gtf.txt
Treated_1 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Treated_1.gtf
Treated_2 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Treated_2.gtf
Treated_3 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Treated_3.gtf
Untreated_1 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Untreated_1.gtf
Untreated_2 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Untreated_2.gtf
Untreated_3 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Untreated_3.gtf

The stringTie website says that for this purpose I have to use “files generated by StringTie (run with the -e parameter).” I used few command lines with stringtie, which one is it the one with that I was supposed to use the –e parameter? probably the error is due to that ..

my python version is Python 2.7.5

Thanks for support

rna-seq • 235 views
ADD COMMENTlink modified 4 weeks ago by Charles Warden6.6k • written 4 weeks ago by Morris_Chair90
1

From your stringtie output (coverage of transcripts) use tximport to summarize to the gene level, see the manual, and then use from DESeq2 DESeqDataSetFromTximport(). Manual of DESeq2, read it thoroughly, it is outstandingly comprehensive.

ADD REPLYlink written 4 weeks ago by ATpoint15k
1

See also Stringtie to DESeq2 and remember to use both google and the search function. These standard questions have been asked many times before. In my experience you learn most (and in a sustainable fashion) if you solve these lowlevel questions yourself by extensive web research. Spoonfeeding it convenient but won't help you in the long term. There are many good resources like blogs and forum posts out there in the web, use them ;-)

ADD REPLYlink written 4 weeks ago by ATpoint15k

HI ATpoint,

I have with me the paper "Transcript-level epxression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown", and for generating the table counts, shows the same command line of the link posted from you.

stringtie -eB -G transcripts.gff <source_file.bam>

What looks weird to me is that I have to create all this files .ctab that are named the same way for each of the bam files and at the end I wonder how can we distinguish one sample to another.

I agree with you when you say to use google for search and believe me that I do so before posting here, often they are very helpful but some other time very confusing. To me asking is also a way to be more confident that I’m doing ok

Thank you for the answer

ADD REPLYlink written 4 weeks ago by Morris_Chair90
2

I haven't used ballgownor stringtie in a while because for well-annotated transcriptomes I never felt the need to assemble transcripts. You might consider quantifying your reads with salmon, then aggregate to gene level with tximport and test for differential expression with DESeq2. This is a well-supported and recommended pipeline, with low computational costs and pretty fast plus salmon has features to correct for GC bias. It is my personal go-to pipeline for standard RNA-seq.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by ATpoint15k

I tried salmon but I got low percentage of mapped reads for a sample, then I decided to go with stringtie.. I don't think to use ballgown anyways

I have an OT: Do you think that DESeq2 and cummRbund do the same job? for example, can I with DESeq2 see the differential expression of specific genes of my interest ? or just the top 10 example

Thanks

ADD REPLYlink written 4 weeks ago by Morris_Chair90
1

If you have low mapping rate this is not a problem with salmon but with your library. If you get higher percentage with hisat2 you should check why that is. Could well be that you have a lot of genomic DNA or rRNA contaminations which salmon will putput as unmapped. read_distribution.py from RSeQC is an option to get an idea where your reads mapped (UTR, exon, intron etc...). What is low percentage? And what is the read length (is it single-or paired-end)? Read length can have a big influence on mapping rate. As for cummRbund it is a downstream tool of cufflinks. I never used it so I cannot give much input what exactly it does but I would stick with the well-maintained DESeq2. If you want to use cummRbund you would need to re-run everything with cufflinks. Do you have human data, and what is the final goal? Differential analysis on the gene level, isoform detection, novel transcripts, alternative splicing?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by ATpoint15k

Hi ATpoint, I answered to my self at some of your question, with low percentage I mean less the 20% of mapping unlike other tools with over 70% and the read lengh=101.
 I have human data, the final goal is to see which genes are upregulated or downregulate along with a set of genes of my interest (that’s why I was asking if DESeq2 can do that). 
Can I detect different isoform expressed from a gene with salmon and DESeq2? this is another aim

To be honest with you at the moment is not a big deal because I’m styding and practicing with RNAseq files that I had in my lab, the files for the actual experiment will come soon and I want to be ready to process’em easily.

ADD REPLYlink written 4 weeks ago by Morris_Chair90

Hi ATpoint, although I get low percentage of mapping rate with salmon, I still want to pursue with it

I have found this way for txt import

files <- file.path(dir, "salmon", samples$run, "quant.sf.gz")
names(files) <- paste0("sample", 1:6)
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene)
head(txi.salmon$counts)

my question is, do you rename the quant.sf files before compressing them? otherwise how do you know which belong to each condition?

thank you

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Morris_Chair90
1

read this strintie`s tutorial to know how exactly run prepDE.py

ADD REPLYlink written 4 weeks ago by Buffo1.5k

that's what I read and that's why there were the name of the samples in the first colum, just in case you need in the future they are needed ;)

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Morris_Chair90

Delete the first colum, the list must specify just the route to each gtf

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Buffo1.5k

Hi Buffo, thanks for the answer but I get errors if I delete the first colum

Error: Text file with sample ID and path invalid (/home/RNAseq/HISAT/BAM/sorted/Files.GTF/Treated_1.gtf)
[@ws7910 Files.GTF]$
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Morris_Chair90
1
gravatar for Charles Warden
4 weeks ago by
Charles Warden6.6k
Duarte, CA
Charles Warden6.6k wrote:

I think you've got the confusion of counts typically being for genes, and StringTie being used for transcript estimations.

However, I see a couple things that may be worth mentioning:

1) In cufflinks, you should merge assemblies from different samples. This should be more robust than any individual assembly. However, while cufflinks has a separate cuffmerge command, Stringtie has a --merge parameter to be called from the stringtie executable.

2) Assuming you have gene symbol annotations, it is possible to calculate unique counts for the gene ID (although it is very important you not use the transcript ID, which will result in many more ambiguous reads, since each isoform will be thought of as an independent gene). Also, unless you are defining a novel gene, I think this arguably makes things unnecessary complicated. For example, you could use known annotations for most genes, and use derfinder to look for regions with no overlap with known genes (although, I would usually start with a conservative set of genes like RefSeq, but you would probably want to use a larger gene set like GENCODE to look for regions with a lack of annotations). Plus, you don't have functional annotations for the novel genes (that would be an advantage to something like MiTranscriptome for lncRNAs, since the name indicates the tumor type enrichment - where you are really quantifying cufflinks annotations from a large number of samples that somebody else processed, for your smaller number of samples).

2b) If you still want to use gene analysis from transcript annotations, summing counts is an option for programs like Salmon. I don't remember if either counts or effective counts are provided for StringTie, but that is theoretically an option.

Similarly, ATpoint had the good suggestion of using tximport for a Salmon quantification.

3) You could arguably use a larger set of exons for something like DEXSeq / JunctionSeq analysis for known genes. However, I think you would still want to primarily want to focus on extensions of known genes. You can also look for novel exons for a selected number of known genes with SGSeq.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Charles Warden6.6k
1

Hi Charles Warden, Thank you for your extensive explanation, My understanding is that the path HISAT,StringTie and Ballgown it's the most updated version of the Tuxedo suite and are used for differential gene expression analysis as well.

All I want to do is learning a workflow to compare the gene level expression of different samples and specific genes of my interest.

ADD REPLYlink written 4 weeks ago by Morris_Chair90
1

You are welcome.

While I thought this was relevant information for the discussion (and the only "answer" instead of a "comment"), I would question whether this should be called the "accepted answer."

If it is your own answer, do you have the option to "moderate" to move a comment to an answer? I think your stringtie -eB -G transcripts.gff is a relevant part of the answer (but I would recommend either using known annotations, the merged assembly, or using a unique gene count instead of stringtie). If you don't have the "moderate" option (and you would like to do this), I would be glad to help.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Charles Warden6.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1301 users visited in the last hour