Question

Create GFF from de novo assembly to input on htseq-counts

0

Entering edit mode

10.9 years ago

Leonardo ▴ 30

Hi fellows,

As a new one on Bioinfo world, I'm experiencing dificulties while learning about. I have a RNA-seq (Illumina 75-100bp) dataset and I used our lab's reference de novo transcriptome to align it and produce a estimation of gene/transcripts expression with RSEM. So, I was planning to work downstream with the RSEM output as input for edgeR, but now I know that isn't the best thing to do. Now I'm planning working with htseq -> DESeq2, but for htseq it's necessary a GFF/GTF file as input, and as we only have a de novo transcriptome assemble, I'm pretty confused now, even after reading about it. Is it possible to create a GFF from this assemble? If I align (bowtie way) my libraries to the reference, would be possible to create a GFF from the .sam output file? Also, is it possible working with htseq without a GFF as an alternative?

To note, this is the way our reference looks like:

>transcript_1
TCTAGTATGTCTTTTCCTCGTTTATTACTTTCTGTATTTTGTGTTAAAAAAAAAGAAAGG
ATCAATGGGAAGGACCAGCGTGCTGAAAAGATTTTGCTTT
>transcript_2
CCATAGGGATGTCAGCATTCTTTTGAAGTCCTCGTAGTCCTGAGGTACAGCCCCTTTCTG
CACTAAGTACCGGTGTAAGTATTTAATTGGCGCTGTTCTGGCAATCTCTTCAATGAAAGC

Thanks in advance!

Leonardo

Unicamp/BR

RNA-Seq genome GFF GTF htseq • 9.3k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by Leonardo ▴ 30

0

Entering edit mode

Why is RSEM not good for edgeR? "So, I was planning to work downstream with the RSEM output as input for edgeR, but now I know that isn't the best thing to do" Is your raw data SE or PE? How did you perform your de novo assembly?

ADD REPLY • link 10.9 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Ugg, this comes up from time to time. Search the bioconductor email list for a very very large number of posts about RSEM's counts, which aren't integers.

ADD REPLY • link 10.9 years ago by Devon Ryan 105k

0

Entering edit mode

You can take the RSEM output and give limma/voom a try. The limma authors have suggested that that's reasonable before (e.g., here).

ADD REPLY • link 10.9 years ago by Devon Ryan 105k

0

Entering edit mode

Hi Stephen, thanks for replying. My data are 14 SE samples (C x T) . I wasn't the one who performed the de novo assembly, but I know that it was using trinity (input of ~600 million reads) with a pipeline for filtrations. So, the "sample" I put is the final file we have. It's a fasta file with ~191000 contigs called as "transcripts". I've said about RSEM output because I've read about the "expected_counts" aren't so precise as "counts", but RSEM don't produce counts. Actually this is another point that's not so clear to me, because some people says it's not so good using "expected_counts" and other ones says it could be possible. DESEq produces counts, but requires a GFF/GTF as input and I don't have yhis file. An alternative way I'm considring now is: performing individual cufflinks for all 14 samples -> get GFFs -> merge all 14 GFFs -> use this GFF on htseq -> htseq-counts. Have you already worked with expected_counts on edgeR?

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by Leonardo ▴ 30

0

Entering edit mode

If your assembly was constructed, merging all 14 samples as inputs to create a single assembly, then you could use the Trinity downstream analysis pipeline for differential expression between all 14 samples. You could map the raw reads, from each sample, back to the new assembly to get abundance estimates with RSEM. Then you use edgeR to get DE results utilizing the outputs from RSEM. With the new version of Trinity, you'll now get, in the edgeR directory, up-regulated genes given the parameters you provide to edgeR, for each comparison of samples. You can do this pipeline for both transcripts and genes.

ADD REPLY • link 10.9 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

The assembly wasn't constructed with my data, but with data from the same species in different conditions. I'm using this assembly to map my 14 samples.

ADD REPLY • link 10.9 years ago by Leonardo ▴ 30

0

Entering edit mode

Ok, got it. Then Devon's suggestion of aligning using bowtie and samtools is probably the best solution.

ADD REPLY • link 10.9 years ago by st.ph.n ★ 2.7k

Ram · Answer 1 · 2014-08-28

3

Entering edit mode

10.9 years ago

Devon Ryan 105k

With a de novo transcriptome you can actually skip htseq-count entirely :) The reason why is that each contig in the SAM/BAM file represent an individual transcript/gene already, so you just need to count non-multimappers per contig. A simple way to do this is to filter out multimappers (if you're using bowtie2, then samtools view -q 2 -Sb foo.bam > foo.filtered), sort and index the result and then just use samtools idxstats. The result will have the counts as one of the columns (the 3rd column I think).

Having said that, for a de novo transcriptome, you might be better off with RSEM -> limma/voom, due simply to getting a bit more reliable numbers with RSEM. I say this because it's likely that your de novo transcriptome has a bunch of multi-transcript genes, which won't be handled well by using the aforementioned method.

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks Devon, I'll try it!

ADD REPLY • link 10.9 years ago by Leonardo ▴ 30

0

Entering edit mode

Hi Devon, another doubt: should this counting non-multimappers step be done by using a outputed RSEM.bam or a BOWTIE2.bam? I'm asking it because mapping with bowtie2 in local mode can produce more alignments than using RSEM, because of RSEM don't consider as valid gapped alignments.

ADD REPLY • link 10.9 years ago by Leonardo ▴ 30

0

Entering edit mode

You can go ahead and use bowtie2.bam (i.e., whatever produces better alignments for this dataset) for that. I'd be curious to hear what happens if you go bowtie2->DESeq2/edgeR/whatever and then compare that to RSEM->limma/voom. I imagine you'd have some overlapping results in both, but it'd be interesting to see how much. You might give them both a try and see what method ends up validating better (since I assume you'll try to validate the results).

ADD REPLY • link 10.9 years ago by Devon Ryan 105k

1

Entering edit mode

Well, the counts differences between bowtie.bam and RSEM.bam were hudge. Greater than I have thought. As an example:

geneID    C_bt2  T_bt2  C_rsem T_rsem
transcript_27 324 1925 164 901
transcript_28 2 29 0 1
transcript_29 0 0 0 0
transcript_30 0 0 0 0
transcript_35 6 24 0 0
transcript_36 582 3872 271 1826
transcript_37 192 1011 90 470

I'll test the DE pipelines and report here.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by Leonardo ▴ 30

0

Entering edit mode

Can I do this for TopHat de novo assembly files? as in TopHat.bam files -->samtools sort ---> samtools index --->samtools idxstats for counts?

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 9.4 years ago by kanika.151 ▴ 160

0

Entering edit mode

I'd be hesitant to do that with de novo assemblies. I would personally prefer to use salmon or kallisto for something like that, I expect the results would be better.

ADD REPLY • link 9.4 years ago by Devon Ryan 105k

0

Entering edit mode

But, I have read several papers using samtools and getting the counts why would it defer?

ADD REPLY • link 9.4 years ago by kanika.151 ▴ 160

0

Entering edit mode

Typically you'll either toss a lot of the reads due to multimapping or double count them if you do this. Neither of those are desirable. You'll see people doing a lot of bad things in the literature.

ADD REPLY • link 9.4 years ago by Devon Ryan 105k

0

Entering edit mode

Hello Devon,

I did De novo transcriptome assemblies ---> Kallisto/Sailish index ---> Kallisto/Sailfish quantification ---> EBSeq, NOISeq and edgeR

To find differentially expressed transcripts... Is this okay?

ADD REPLY • link 9.2 years ago by kanika.151 ▴ 160

1

Entering edit mode

Looks OK to me!

ADD REPLY • link 9.2 years ago by Devon Ryan 105k