Differential gene expression from RNAseq data. Before or after annotation?
1
0
Entering edit mode
3.3 years ago
pablo61991 ▴ 90

Hello community,

I have this doubt and I have read different versions. However, most of them are from several years ago so, in order to obtain an up-to-date answer I come here to get fed by your wisdom.

When we are working with non-model organisms and we perform our assemblies. At the end we have a .fasta file with all our transcripts which corresponds either to true transcripts and artifracts/nonsense/bad transcripts.I know they are several options to prioritize "good >> bad", like Transrate "good transcript" subset or EvidentialGene ".okay (or even .okay + .alt) subset".

Well, lets suppose we have ran one of these filters and we have also checked the % of mapping, the BUSCO score and also de Transrate score, and we had our confidence .fasta file with our good transcript. What is the correct order to proceed at this point?

Run some "counting" tool like kallisto on our whole transcript file? Try some kind of annotation and then run our kallisto only over our transcript with some hit?

I have this doubt because, usually I found a % of annotation (being generous) of more or less 50% so i don't know if the other 50% are false transcript or just unknown transcripts.

For example, if my next step is to fed DESeq2 with these counts, I need to use tximport and I need to give to the program the tx2gene file which come from my annotation step. But if I have run kallisto over the whole unannotated transcriptome, then I found a lot of "lost" information. Could be better if I take out this unannotated transcripts and "force" kallisto to find a place for these reads in my "annotated" assembly?

Pablo

rnaseq annotation differential gene expression • 2.8k views
1
Entering edit mode
3.3 years ago
Macspider ★ 3.4k

we are working with non-model organisms [...] usually I found a % of annotation (being generous) of more or less 50% [...] i don't know if the other 50% are false transcript or just unknown transcripts

Well, with a non-model organism you can't expect a 100% of matches. Many sequences won't be annotated in the closely related organisms as well.

Differential gene expression from RNAseq data. Before or after annotation?

Do it on the set of transcripts that you consider reliable. The functional annotation doesn't really tell you anything about the ones that you can't annotate: it only adds info to the ones you can annotate.

0
Entering edit mode

Ok, I can understand this and the logic behind "functional annotation doesn't really tell you anything about the ones that you can't annotate" but in this case, how I deal with the requirement of tx2genes file?

0
Entering edit mode

Generate raw counts with a tool that counts the features on the genes (htseq, featurecounts) and then pass that one to DESeq2. Annotation is not required in any of these steps.

0
Entering edit mode

I have another problem with use an unnanotated set. What about spare counts which in fact come from the same "gene"? For example, if I review my annotation, sometimes I can find maybe >10 times the same blast hit. If I don't give to DESeq2 the information of that, it'll interpret each one like a single "event" with a X number of reads assigned to it.

0
Entering edit mode

DESeq2 only normalizes counts, then estimates the dispersion, and calculates LFC for each gene. It doesn't even know what an annotation is, it just wants a matrix to process.

The feature annotation is only a label that you append to your genes to know what their most likely function is, but all that counts in these steps is the sequence identity when mapping.

0
Entering edit mode

Yes, but when you fed DESeq2 with your tx2genes file, it can collapse all the counts which come from the different transcripts which come from a single gene. So in order to find a differentially expressed gene, is not the same to have the reads counts distributed in 10 different transcript of the same gene than have all these reads assigned to this gene.

And in recent papers MLove states the advantages to work at this "gene" level than work at "transcript" level.

0
Entering edit mode

I do work at the gene level too, but I provide to DESeq2 the counts per gene instead of the counts per transcripts. And it is as easy as summing up the counts of each transcript belong to a gene. It only requires a very little bit of python / perl...

0
Entering edit mode

Yeah, but in fact I want to avoid to assume each transcript is a gene, because my experience its I have between 1-more than 10 transcripts per gene. I can not see how your approach take out the problem of spare reads which in fact corresponds to the same gene.

0
Entering edit mode

Do your counting at the gene level instead of the transcript level. If you are going the alignment route with say HiSat you just need to take the BAMs along with a gene annotation file and run featureCounts, then import that into DESeq2 for the estimation and differential expression analysis, no need for tx2gene. If you are going to do DE at the gene level anyway, versus the transcript level, it avoids all of the headaches of reads having been assigned to multiple transcripts in the counting process.

If you then want to investigate individual transcripts you can do that as well. HiSat2 -> StringTie -> DESEq2 will work just fine. Where you add annotations in the process doesn't really matter, as the estimated transcripts will just receive an internal ID from stringtie anyway.

0
Entering edit mode

Thank you I'll try this.

0
Entering edit mode

1 - Map the reads against the transcriptome

2 - sum up transcript counts into gene counts

for transcript in transcriptome:
x = counts(transcript)
counts[gene] +=  x


3 - generate a table with counts

4 - feed it to DESeq2

0
Entering edit mode

Thank you for your time, I'll try this

0
Entering edit mode

Unless the method you used only assigns a read count to a single transcript this really isn't a good way to do it. You are better off doing the counts at the gene level in the first place.

0
Entering edit mode

It does assign it to a single transcript, you shouldn't consider multiply mapping reads in expression analyses, or at least you should select one mapping position per read.

1
Entering edit mode

But when we face off against de novo assemblies from non model organisms, redundancy is just a common scenario. As consequence multiply mapping reads is just a common issue. For that my defence of "gene level".

As in this thread, I have received contradictory input related to this question from colleagues working in the area. Part of them pref the transcript level and are less worried about "highly similar transcripts capturing reads which in fact correspond to he same gene, juts a shorter version" and the other ones pref gene level but know "in fact several transcripts which look like the same gene, just could be considered the same gene if they are identical 100%. If not, they can be just different isoforms".

I don't know if nowadays there ir a gold standard when we are working on non model organisms whit weak annotations, high redundancy level and also relatively high artifracts/incorrect transcripts.

1
Entering edit mode

I work on humans and even there I do my gene-level and transcript-level analyses mostly independently from one another. We have extremely rapid count estimators and programs for dealing with these so it never makes sense to me to save a bit of time with a shortcut that introduces some additional wrinkles into my statistical analyses.

0
Entering edit mode

If I am doing a full alignment against a transcriptome, multi-mapping gives me information that I can use. I can still run that process and then get my counts at the gene level from a program like featureCounts, which I can give specific parameters to on how to handle multi-mapping reads. That was my main point. There are good and very fast programs that do this, versus writing a simple counting script that doesn't.

1
Entering edit mode

@DanGaston definitely my point too, what I wrote was a pseudocode to show that the logic behind it is not that hard.

I usually accept multimapping reads and then select the primary alignment for each read. For two equally good alignments (i.e. same score) many mapping programs like the tuxedo-suite ones will decide randomly on which one to label as "primary". This way, you will still have only one score per read, and it will be randomly assigned to one of the many locations. The same can be said for all the multimapping reads generated in the same region.

Example: 3 transcripts are identical. Reads from each of those map on all the three. Therefore the coverage for each is 3x with respect to transcripts that harbor only uniquely mapped reads. Selecting only the primary alignments will likely reduce the coverage to 1x in these genes as well, because the primary alignment is randomly chosen for each read that is mapped there.

If I am not making sense to you, please tell me, I'll try to explain myself better.

1
Entering edit mode

I got you, well explained. That makes perfect sense. And the explanation is useful to the OP and other's reading this.

0
Entering edit mode

The main problem here is the biggest difference of background between forum members when the topic is something as complex as this topic is.

You know... (some) supervisors try to take in to their projects fancy "new" technologies and then look for a random PhD student without ANY background on the topic who finally gets lost in a big abyss.