Question: Create GFF from de novo assembly to input on htseq-counts
0
gravatar for Leonardo
5.9 years ago by
Leonardo30
Brazil
Leonardo30 wrote:

Hi fellows,

As a new one on Bioinfo world, I'm experiencing dificulties while learning about. I have a RNA-seq (Illumina 75-100bp) dataset and I used our lab's reference de novo transcriptome to align it and produce a estimation of gene/transcripts expression with RSEM. So, I was planning to work downstream with the RSEM output as input for edgeR, but now I know that isn't the best thing to do. Now I'm planning working with htseq -> DESeq2, but for htseq it's necessary a GFF/GTF file as input, and as we only have a de novo transcriptome assemble, I'm pretty confused now, even after reading about it. Is it possible to create a GFF from this assemble? If I align (bowtie way) my libraries to the reference, would be possible to create a GFF from the .sam output file? Also, is it possible working with htseq without a GFF as an alternative?

To note, this is the way our reference looks like:

>transcript_1
TCTAGTATGTCTTTTCCTCGTTTATTACTTTCTGTATTTTGTGTTAAAAAAAAAGAAAGG
ATCAATGGGAAGGACCAGCGTGCTGAAAAGATTTTGCTTT
>transcript_2
CCATAGGGATGTCAGCATTCTTTTGAAGTCCTCGTAGTCCTGAGGTACAGCCCCTTTCTG
CACTAAGTACCGGTGTAAGTATTTAATTGGCGCTGTTCTGGCAATCTCTTCAATGAAAGC

Thanks in advance!

Leonardo

Unicamp/BR

htseq rna-seq gff gtf genome • 5.9k views
ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by Leonardo30

Why is RSEM not good for edgeR? "So, I was planning to work downstream with the RSEM output as input for edgeR, but now I know that isn't the best thing to do" Is your raw data SE or PE? How did you perform your de novo assembly?

ADD REPLYlink written 5.9 years ago by st.ph.n2.5k

Ugg, this comes up from time to time. Search the bioconductor email list for a very very large number of posts about RSEM's counts, which aren't integers.

ADD REPLYlink written 5.9 years ago by Devon Ryan95k

You can take the RSEM output and give limma/voom a try. The limma authors have suggested that that's reasonable before (e.g., here).

ADD REPLYlink written 5.9 years ago by Devon Ryan95k

Hi Stephen, thanks for replying. My data are 14 SE samples (C x T) . I wasn't the one who performed the de novo assembly, but I know that it was using trinity (input of ~600 million reads) with a pipeline for filtrations. So, the "sample" I put is the final file we have. It's a fasta file with ~191000 contigs called as "transcripts". I've said about RSEM output because I've read about the "expected_counts" aren't so precise as "counts", but RSEM don't produce counts. Actually this is another point that's not so clear to me, because some people says it's not so good using "expected_counts" and other ones says it could be possible.  DESEq produces counts, but requires a GFF/GTF as input and I don't have yhis file. An alternative way I'm considring now is: performing individual cufflinks for all 14 samples -> get GFFs -> merge all 14 GFFs -> use this GFF on htseq -> htseq-counts. Have you already worked with expected_counts on edgeR?

ADD REPLYlink written 5.9 years ago by Leonardo30

If your assembly was constructed, merging all 14 samples as inputs to create a single assembly, then you could use the Trinity downstream analysis pipeline for differential expression between all 14 samples. You could map the raw reads, from each sample, back to the new assembly to get abundance estimates with RSEM. Then you use edgeR to get DE results utilizing the outputs from RSEM. With the new version of Trinity, you'll now get, in the edgeR directory, up-regulated genes given the parameters you provide to edgeR, for each comparison of samples. You can do this pipeline for both transcripts and genes.

ADD REPLYlink written 5.9 years ago by st.ph.n2.5k

The assembly wasn't constructed with my data, but with data from the same species in different conditions. I'm using this assembly to map my 14 samples.

ADD REPLYlink written 5.9 years ago by Leonardo30

Ok, got it. Then Devon's suggestion of aligning using bowtie and samtools is probably the best solution.

ADD REPLYlink written 5.9 years ago by st.ph.n2.5k
3
gravatar for Devon Ryan
5.9 years ago by
Devon Ryan95k
Freiburg, Germany
Devon Ryan95k wrote:

With a de novo transcriptome you can actually skip htseq-count entirely :) The reason why is that each contig in the SAM/BAM file represent an individual transcript/gene already, so you just need to count non-multimappers per contig. A simple way to do this is to filter out multimappers (if you're using bowtie2, then samtools view -q 2 -Sb foo.bam > foo.filtered), sort and index the result and then just use samtools idxstats. The result will have the counts as one of the columns (the 3rd column I think).

Having said that, for a de novo transcriptome, you might be better off with RSEM -> limma/voom, due simply to getting a bit more reliable numbers with RSEM. I say this because it's likely that your de novo transcriptome has a bunch of multi-transcript genes, which won't be handled well by using the aforementioned method.
 

ADD COMMENTlink written 5.9 years ago by Devon Ryan95k

Thanks Devon, I'll try it!

ADD REPLYlink written 5.9 years ago by Leonardo30

Hi Devon, another doubt: should this counting non-multimappers step be done by using a outputed RSEM.bam or a BOWTIE2.bam? I'm asking it because mapping with bowtie2 in local mode can produce more alignments than using RSEM, because of RSEM don't consider as valid gapped alignments.

ADD REPLYlink written 5.9 years ago by Leonardo30

You can go ahead and use bowtie2.bam (i.e., whatever produces better alignments for this dataset) for that. I'd be curious to hear what happens if you go bowtie2->DESeq2/edgeR/whatever and then compare that to RSEM->limma/voom. I imagine you'd have some overlapping results in both, but it'd be interesting to see how much. You might give them both a try and see what method ends up validating better (since I assume you'll try to validate the results).

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by Devon Ryan95k
1

Well, the counts differences between bowtie.bam and RSEM.bam were hudge. Greater than I have thought. As an example: 

geneID    C_bt2  T_bt2  C_rsem T_rsem

transcript_27 324 1925 164 901
transcript_28 2 29 0 1
transcript_29 0 0 0 0
transcript_30 0 0 0 0
transcript_35 6 24 0 0
transcript_36 582 3872 271 1826
transcript_37 192 1011 90 470

I'll test the DE pipelines and report here.

 

ADD REPLYlink written 5.9 years ago by Leonardo30

Can I do this for TopHat de novo assembly files? as in TopHat.bam files -->samtools sort ---> samtools index --->samtools idxstats for counts?

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by kanika.15180

I'd be hesitant to do that with de novo assemblies. I would personally prefer to use salmon or kallisto for something like that, I expect the results would be better.

ADD REPLYlink written 4.4 years ago by Devon Ryan95k

But, I have read several papers using samtools and getting the counts why would it defer?

ADD REPLYlink written 4.4 years ago by kanika.15180

Typically you'll either toss a lot of the reads due to multimapping or double count them if you do this. Neither of those are desirable. You'll see people doing a lot of bad things in the literature.

ADD REPLYlink written 4.4 years ago by Devon Ryan95k

Hello Devon,

I did De novo transcriptome assemblies ---> Kallisto/Sailish index ---> Kallisto/Sailfish quantification ---> EBSeq, NOISeq and edgeR

To find differentially expressed transcripts... Is this okay?

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by kanika.15180
1

Looks OK to me!

ADD REPLYlink written 4.2 years ago by Devon Ryan95k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1656 users visited in the last hour