De novo assembly of mRNA-seq has many transcripts binding to introns or genome parts (not protein coding genes)
1
0
Entering edit mode
10 months ago
Alessandra • 0

Hi all,

For the first time, I am working on some de novo mRNAseq data already analyzed by a company (the same that perfomed the sequencing). The samples are from a species for which we don't have an annotated genome. They assembled the transcriptome with trinity (which resulted in more than 100.000 different transcripts) and then they performed differential expression analysis with DESeq. The transcripts were identified using NR, NT, Pfam, GO, KEGG databases and I have the information of the first 10 best hits for each (if present). I also have information on the coding potential of all of the transcripts (done with BLASTX and TransDecoder, resulting in 37.000 sequences with coding potential).

I also have some mRNA-seq from zebrafish, where reads were all aligned to the known genome (around 35.000 genes). The experiment is the same, so the objective is to compare the differentially expressed genes across the two species.

Now, 100.000 transcripts sounds like a lot, so I was wondering 3 things:

1) is this normal with de novo assemblies and if so what is it due to?

2) since many of these transcripts aligned (with 70-90% similarity) to either intronic or intergenic regions of other species' genomes, would it make sense to only consider the sequences with coding potential (BLASTX/transdecoder results) for differential expression analyses?

3) would it otherwise make sense to get rid of all transcripts with low E-value (BLAST)? So those that were not successfully aligned to any database hit? I've seen papers consider only those with e-value <e-5 and others <e-30

4) in general, what would be the cutoff of E-value up to which I can trust the BLAST result?

I hope I explained myself clearly enough - I'm a bioinformatics/RNAseq newbie but I'd like to learn :) Thank you!!!

transcriptomics RNA-seq denovo trinity transdecodr • 1.0k views
ADD COMMENT
1
Entering edit mode
9 months ago

Several observations:

  • In general I find that around 35%-60% of a total RNA sample will map to coding exons, with the rest split between intronic and intergenic sequence.

  • The lastest assembly of the Zebrafish genome has ~60,000 transcripts. Its 35,000 genes, not 35,000 transcripts, but your trinity assembly is transcripts not genes.

  • 60,000 transcripts for 35,000 genes sounds like not enough to me. I would be surprised if there wasn't a larger number transcripts out there that haven't been found before. In particular, the number of non-coding transcripts in the genome (~7k) looks very low to me. For comparison, human has ~50k genes, of which only 20k are coding, and more like 250k transcripts.

  • Becareful comparing transcripts between genomes by DE. DE algorithmns assume that the transcripts are the same between samples. The most obvious point is that it is assumed that the transcripts are the same length. But also that sequence bias is the same between samples. Thus, I'd only use transcripts where there is a really close match between the two species, and that transcripts are full-length (i.e. not fragments), or only consider reads mapping the matched parts transcripts in both species.

ADD COMMENT
0
Entering edit mode

Thank you for the observations. I did not fully realize the difference between transcripts and genes until you pointed it out - but then I guess that means in the zebrafish analyses, the different transcript variants are all counted together. While in the de novo, they are counted separately. Making it harder to compare the DE then, did I get it right?

Regarding your point about DE algorithms, my plan was to compare the DEseq results between the two species, so for instance see if gene X that is DE in zebrafish is also DE in the other species, as well as compare the GO and KEGG enrichment analyses for both.

In case I only compare full length transcripts for both species, would I run a separate DEG analysis only considering these genes, or would I start from the DEG analysis done with all available transcripts, then only look at those that are full length?

Thank you!!!

ADD REPLY
1
Entering edit mode

Ah! Sorry, I misunderstood your experimental design - I thought you were directly comparing expression in species X to expression in zebrafish, not DE in species X to DE in Zebra fish. In that case it should be fine.

As regards the transcript/gene issue.... Two appraoches you might consider are clustering together the species X transcripts into genes. This would be easy for those that aligned to Zebrafish genes, and I guess you don't care about those that don't! Alterntaively you could quantify zebrafish transcripts rather than genes. You can use Salmon, Kalisto or RSEM to do this.

ADD REPLY
0
Entering edit mode

That makes sense, thanks!

I will look into clustering the transcripts together into genes, however many aligned to genes from other (more closely related probably) species. Thanks for the input

ADD REPLY
0
Entering edit mode

Hi, I went through the company's pipeline again and realized that they are actually using Corset to cluster the transcripts into genes after trinity assembly. This means that my 100.000 transcripts are actually (likely?) genes.

This should make DE comparison with zebrafish easier. But it also means that I have many more "genes" identified. Would this affect the DE analysis? Many of these genes remain unidentified. 30% are annotated in NR and 10% in Pfam, 80% in NT (but this includes similarities with genomic sequences of other species...)

Would it make sense to get rid of the genes that are not annotated, or that show no coding potential before running DE analysis? Or should I keep them anyways and disregard them afterwards?

ADD REPLY
0
Entering edit mode

Depends on your aims, but if you are interested in how similar the transcriptional responses are between Zebrafish and your species of interest, then there isn't much point in looking at genes that arn't present in Zebrafish.

ADD REPLY

Login before adding your answer.

Traffic: 850 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6