Question

DEG analysis from Salmon quantification of 3' TagSeq reads with unannotated reference transcriptome

0

Entering edit mode

2.3 years ago

Corey • 0

I have a reference transcriptome in Fasta format, unannotated, for a non-model organism, along with trimmed raw reads generated by 3' TagSeq from a two-factor experiment. I generated quantification files from these reads using Salmon. However, after this point I'm stuck - it seems (from several different workflows I've found) like in order to use any of the popular differential gene expression analysis pipelines (eg DEseq2, edgeR) I also need a .gtf file to create a two-column dataframe linking transcript ID to gene ID, which I of course don't have - the reference transcriptome is unannotated and there is no genome for this organism. Is there some way to get around this, generate a GTF from what I have, or an alternative pipeline for estimating "genes" that transcripts in the transcriptome correspond to in order to proceed with my differential expression analysis?

RNAseq DEG TagSeq • 1.2k views

ADD COMMENT • link updated 2.3 years ago by Michael Love ★ 2.6k • written 2.3 years ago by Corey • 0

4

Entering edit mode

2.3 years ago

Michael Love ★ 2.6k

Just to follow up on Gordon's post, you can read Salmon quantification into R without gene-level summarization with:

library(tximport)
txi <- tximport(files, type="salmon", txOut=TRUE)

tximport imports data from a variety of upstream tools into a simple format of a list of matrices. Inferential replicates can also be imported, and if more complex data structures are desired (SummarizedExperiment) we have a companion package tximeta.

We have pipelines in the tximport vignette for running with DESeq2 or other tools. This pipeline is preferred to just manually importing the counts alone from the Salmon output, as the tximport pipeline takes into account the effective transcript lengths, which have information about sample-specific biases.

ADD COMMENT • link 2.3 years ago by Michael Love ★ 2.6k

score 4 · Accepted Answer · 2022-01-08

You don't need a GTF file. The TagSeq technology is 3' orientated, so you are essentially doing a gene-oriented analysis already.

Since your non-model organism seems to be lightly documented, I am also guessing that the reference transcriptome may be mainly limited to the dominant transcript for each gene and so it is reasonable to expect relatively little overlap between the transcript sequences compared to what would be found in a comprehensive transcriptome.

For both these reasons, you can probably simply input the expected read counts from Salmon directly into edgeR or limma, neither of which require integer counts. See edgeR::catchSalmon() which can read the Salmon files into a suitable form. DESeq2 could also be used if the counts are rounded to integers.

If you are worried about overlap between the transcript sequences, then set Salmon to run bootstrap samples and catchSalmon() can use them to adjust for mapping ambiguity.