Question

Will this de novo transcriptome assembly be useful in looking at RNAseq differential expression?

0

Entering edit mode

3.9 years ago

kristina.mahan ▴ 170

I have a de novo plant transcriptome assembly with the following stats from a company. Can it be used to evaluate rnaseq differential expression data? I have csv files with expression data and I am wondering if I should start looking at this data or if I need to improve the transcriptome assembly myself. There is no genome.

contigs: 1499698
smallest contig: 201
largest contig: 13777
n_bases: 787989217
mean_len: 525.43193
n_under_200: 0
n_over_1K: 177786
n_over_10k: 25
n_with_orf: 246891
mean_orf_percent: 65.70842
n90: 236 
n70: 354
n50: 738
n30: 1502
n10: 2862
gc: 0.44474
bases_n: 0
proportion_n: 0.0
score: NA
optimal_score: NA
cut_off: NA
weighted: NA

de-novo-assembly • 1.4k views

ADD COMMENT • link updated 21 months ago by Ram 45k • written 3.9 years ago by kristina.mahan ▴ 170

0

Entering edit mode

In addition to what @ponganta has said:

What kind of a plant is this?
Is this paired end data?
What was the assembler you used?
What kind of quality control was done on the data prior to assembly? Did you run FastQC on your raw data, for example?
Were any QC measures applied to the assembly prior to estimating this TransRate report?
Did TransRate throw you an error when you ran it?
Is the data set enriched for mRNAs?
You state this data is for differential expression analysis; is this assembly from one sample/replicate from the DE analysis, or is it a pooled assembly of all samples/replicates you have?

For a cursory glance, it seems like a slightly underwhelming assembly. That's a lot more assembled contigs than I've encountered in most cases, and the same goes for the alleged protein coding sequences. I have the impression that what's been assembled is quite fragmented.

ADD REPLY • link 3.9 years ago by Dunois ★ 2.9k

score 2 · Answer 1 · 2021-07-31

It's hard to tell from these metrics. I'd suggest downloading the Oyster River Protocol and running the included version of TransRate with your raw data and the assembled contigs. This will give you the Transrate Assembly Score, which is a measure for actual read support of the assembled sequences. For biological completeness of the assembled sequences, you could try and run BUSCO with e.g. the embryophyta dataset.

If your assembly has a rather low TransRate Score (say < 0.15), and is missing a huge amount of BUSCOs, you may be able to get a better Assembly yourself. What ploidy level does your organism have? What kind of data do you have (Illumina? stranded PE, PE, SE)? How much data do you have (millions of reads? hundreds of millions?). All these things would be important to consider a potential assembly.

One thing though: The number of contigs looks slightly suspicious, especially since only 1/5th of your contigs seem to have an ORF. However, if the assembly is of otherwise good quality, this may be a non-problem if you aim at gene-level DE analysis (since you can aggregate counts on a gene-level).

Edit: Thanks to the reformatting done by GenoMax, I now see that this is a TransRate report. Do you know why they did not include the read-based assessment? Which company did your assembly if I may ask?