Hi fellows,
As a new one on Bioinfo world, I'm experiencing dificulties while learning about. I have a RNA-seq (Illumina 75-100bp) dataset and I used our lab's reference de novo transcriptome to align it and produce a estimation of gene/transcripts expression with RSEM. So, I was planning to work downstream with the RSEM output as input for edgeR, but now I know that isn't the best thing to do. Now I'm planning working with htseq -> DESeq2, but for htseq it's necessary a GFF/GTF file as input, and as we only have a de novo transcriptome assemble, I'm pretty confused now, even after reading about it. Is it possible to create a GFF from this assemble? If I align (bowtie way) my libraries to the reference, would be possible to create a GFF from the .sam output file? Also, is it possible working with htseq without a GFF as an alternative?
To note, this is the way our reference looks like:
>transcript_1
TCTAGTATGTCTTTTCCTCGTTTATTACTTTCTGTATTTTGTGTTAAAAAAAAAGAAAGG
ATCAATGGGAAGGACCAGCGTGCTGAAAAGATTTTGCTTT
>transcript_2
CCATAGGGATGTCAGCATTCTTTTGAAGTCCTCGTAGTCCTGAGGTACAGCCCCTTTCTG
CACTAAGTACCGGTGTAAGTATTTAATTGGCGCTGTTCTGGCAATCTCTTCAATGAAAGC
Thanks in advance!
Leonardo
Unicamp/BR
Why is RSEM not good for edgeR? "So, I was planning to work downstream with the RSEM output as input for edgeR, but now I know that isn't the best thing to do" Is your raw data SE or PE? How did you perform your de novo assembly?
Ugg, this comes up from time to time. Search the bioconductor email list for a very very large number of posts about RSEM's counts, which aren't integers.
You can take the RSEM output and give limma/voom a try. The limma authors have suggested that that's reasonable before (e.g., here).
Hi Stephen, thanks for replying. My data are 14 SE samples (C x T) . I wasn't the one who performed the de novo assembly, but I know that it was using trinity (input of ~600 million reads) with a pipeline for filtrations. So, the "sample" I put is the final file we have. It's a fasta file with ~191000 contigs called as "transcripts". I've said about RSEM output because I've read about the "expected_counts" aren't so precise as "counts", but RSEM don't produce counts. Actually this is another point that's not so clear to me, because some people says it's not so good using "expected_counts" and other ones says it could be possible. DESEq produces counts, but requires a GFF/GTF as input and I don't have yhis file. An alternative way I'm considring now is: performing individual cufflinks for all 14 samples -> get GFFs -> merge all 14 GFFs -> use this GFF on htseq -> htseq-counts. Have you already worked with expected_counts on edgeR?
If your assembly was constructed, merging all 14 samples as inputs to create a single assembly, then you could use the Trinity downstream analysis pipeline for differential expression between all 14 samples. You could map the raw reads, from each sample, back to the new assembly to get abundance estimates with RSEM. Then you use edgeR to get DE results utilizing the outputs from RSEM. With the new version of Trinity, you'll now get, in the edgeR directory, up-regulated genes given the parameters you provide to edgeR, for each comparison of samples. You can do this pipeline for both transcripts and genes.
The assembly wasn't constructed with my data, but with data from the same species in different conditions. I'm using this assembly to map my 14 samples.
Ok, got it. Then Devon's suggestion of aligning using bowtie and samtools is probably the best solution.