Dear all, After generating bunch of files with stringTie command line, I want to use DESeq2 for differential expression analysis. I understand that I have to create a matrix of read counts and the tool to use is prepDE.py,
prepDE.py -i list_gtf.txt
my output is
0 Treated_1
1 Treated_2
Traceback (most recent call last):
File "/home/miniconda2/bin/prepDE.py", line 257, in <module>
geneDict[geneIDs[i]][s[0]]+=v[s[0]]
KeyError: 'Treated_2'
Here is my list
[@ws7910 Files.GTF]$ cat list_gtf.txt
Treated_1 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Treated_1.gtf
Treated_2 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Treated_2.gtf
Treated_3 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Treated_3.gtf
Untreated_1 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Untreated_1.gtf
Untreated_2 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Untreated_2.gtf
Untreated_3 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Untreated_3.gtf
The stringTie website says that for this purpose I have to use “files generated by StringTie (run with the -e parameter).” I used few command lines with stringtie, which one is it the one with that I was supposed to use the –e parameter? probably the error is due to that ..
my python version is Python 2.7.5
Thanks for support
From your stringtie output (coverage of transcripts) use tximport to summarize to the gene level, see the manual, and then use from DESeq2
DESeqDataSetFromTximport()
. Manual of DESeq2, read it thoroughly, it is outstandingly comprehensive.See also Stringtie to DESeq2 and remember to use both google and the search function. These standard questions have been asked many times before. In my experience you learn most (and in a sustainable fashion) if you solve these lowlevel questions yourself by extensive web research. Spoonfeeding it convenient but won't help you in the long term. There are many good resources like blogs and forum posts out there in the web, use them ;-)
HI ATpoint,
I have with me the paper "Transcript-level epxression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown", and for generating the table counts, shows the same command line of the link posted from you.
What looks weird to me is that I have to create all this files .ctab that are named the same way for each of the bam files and at the end I wonder how can we distinguish one sample to another.
I agree with you when you say to use google for search and believe me that I do so before posting here, often they are very helpful but some other time very confusing. To me asking is also a way to be more confident that I’m doing ok
Thank you for the answer
I haven't used
ballgown
orstringtie
in a while because for well-annotated transcriptomes I never felt the need to assemble transcripts. You might consider quantifying your reads withsalmon
, then aggregate to gene level withtximport
and test for differential expression withDESeq2
. This is a well-supported and recommended pipeline, with low computational costs and pretty fast plussalmon
has features to correct for GC bias. It is my personal go-to pipeline for standard RNA-seq.I tried salmon but I got low percentage of mapped reads for a sample, then I decided to go with stringtie.. I don't think to use ballgown anyways
I have an OT: Do you think that DESeq2 and cummRbund do the same job? for example, can I with DESeq2 see the differential expression of specific genes of my interest ? or just the top 10 example
Thanks
If you have low mapping rate this is not a problem with salmon but with your library. If you get higher percentage with
hisat2
you should check why that is. Could well be that you have a lot of genomic DNA or rRNA contaminations which salmon will putput as unmapped.read_distribution.py
fromRSeQC
is an option to get an idea where your reads mapped (UTR, exon, intron etc...). What islow percentage
? And what is the read length (is it single-or paired-end)? Read length can have a big influence on mapping rate. As forcummRbund
it is a downstream tool ofcufflinks
. I never used it so I cannot give much input what exactly it does but I would stick with the well-maintained DESeq2. If you want to usecummRbund
you would need to re-run everything withcufflinks
. Do you have human data, and what is the final goal? Differential analysis on the gene level, isoform detection, novel transcripts, alternative splicing?Hi ATpoint, I answered to my self at some of your question, with low percentage I mean less the 20% of mapping unlike other tools with over 70% and the read lengh=101. I have human data, the final goal is to see which genes are upregulated or downregulate along with a set of genes of my interest (that’s why I was asking if DESeq2 can do that). Can I detect different isoform expressed from a gene with salmon and DESeq2? this is another aim
To be honest with you at the moment is not a big deal because I’m styding and practicing with RNAseq files that I had in my lab, the files for the actual experiment will come soon and I want to be ready to process’em easily.
Hi ATpoint, although I get low percentage of mapping rate with salmon, I still want to pursue with it
I have found this way for txt import
my question is, do you rename the quant.sf files before compressing them? otherwise how do you know which belong to each condition?
thank you
read this strintie`s tutorial to know how exactly run prepDE.py
that's what I read and that's why there were the name of the samples in the first colum, just in case you need in the future they are needed ;)
Delete the first colum, the list must specify just the route to each gtf
Hi Buffo, thanks for the answer but I get errors if I delete the first colum
Hello, may I ask this means if you want both
novel
transcript analysis andDESeq2
you have to runStringtie
twice? One-e
one not.