Question

RNA-seq de novo assembly alignment quality assessment

1

Entering edit mode

3.0 years ago

sarahmanderni ▴ 100

Hi,

I have an RNA-seq dataset from a non-model organism and need to do de novo assembly. After running Trinity, I checked the quality of the assembled transcriptome by aligning back the reads using both bowtie and bowtie2. The problem is that the "aligned concordantly exactly 1 time" percentage is very low. For instance from bowtie2:

40191191 reads; of these:
  40191191 (100.00%) were paired; of these:
    1107263 (2.75%) aligned concordantly 0 times
    811254 (2.02%) aligned concordantly exactly 1 time
    38272674 (95.23%) aligned concordantly >1 times
    ----
    1107263 pairs aligned concordantly 0 times; of these:
      3136 (0.28%) aligned discordantly 1 time
    ----
    1104127 pairs aligned 0 times concordantly or discordantly; of these:
      2208254 mates make up the pairs; of these:
        585954 (26.53%) aligned 0 times
        74874 (3.39%) aligned exactly 1 time
        1547426 (70.07%) aligned >1 times
99.27% overall alignment rate

The same trend was observed when using busco to check the quality of assembly:

C:92.4%[S:5.8%,D:86.6%],F:1.3%,M:6.3%,n:4596

I have used BBduk and Trimmomatic to remove adapters and low quality reads. For cleaned reads, fastqc confirmed removal of adapters and per base quality but failed “Sequence Duplication Levels”. I assumed this is very common with RNA-seq data (?).

Now, I wanted to know if someone can suggest what improvement can be done in this case? Can I continue with this assembly? My ultimate plan is to run corset to remove the redundancies, use RSEM to quantify the expression levels and then performing DE analysis. Thanks for the help in advance!

trinity rnaseq denovo assembly • 1.9k views

ADD COMMENT • link 3.0 years ago by sarahmanderni ▴ 100

1

Entering edit mode

but failed “Sequence Duplication Levels”. I assumed this is very common with RNA-seq data (?).

Yes, that is very common for RNA-seq data. However, a possible cause of that 38272674 (95.23%) aligned concordantly >1 times is the presence rRNA reads in your libraries. Long story short, you failed the depletion of rRNA during the library preparation

ADD REPLY • link 3.0 years ago by andres.firrincieli 3.6k

0

Entering edit mode

I tried sortmerna to filter out the rRNA reads. But for each sample, it detected less than 1% (mostly 0.5 - 0.7%) of the reads as rRNA. I assume the rRNA reads should have been removed when I ran BBduk in the beginning of my analysis. Should I still be suspicious about rRNA reads?

ADD REPLY • link 3.0 years ago by sarahmanderni ▴ 100

1

Entering edit mode

No worries, less than 1% is fine. rRNA is not the problem

ADD REPLY • link 3.0 years ago by andres.firrincieli 3.6k

score 2 · Answer 1 · 2021-04-25

2

Entering edit mode

3.0 years ago

ponganta ▴ 590

I assume you are interested in gene-level DGE.

What kind of genome does the organism have? You may have homeologs here, which may have resulted in many duplicated sequences. This is indicated by your BUSCO-result. That's not bad per se, but you need to carefully choose your quantification tools. If you want to infer DGE, go the gene-level route (annotation/clustering + tximport). Also, do not use bowtie2 for quantification. It may be tempting, but Salmon/Kallisto are capable of statistically partition read counts over similar transcripts and have lower computational cost. I'd be happy if you could post your way forward, once you have decided!

Best of luck.

Edit: elaborated a bit more on quasi-mappers.

Edit2: sequence duplication levels are nothing to go by in RNA-Seq. I'd completely ignore them. Use Transrate to evaluate your assembly. For Transrate, you can download the Oyster River Protocol and use the packaged version (Transrate-ORP).

Edit3: If you find low Transrate Score (< 0.15), try and compare other assemblers.

ADD COMMENT • link 3.0 years ago by ponganta ▴ 590

2

Entering edit mode

Never thought about homeologs as the cause of all that duplication, interesting to know.

OP, go with this suggestion here. Salmon + aggregating the produced read counts to the Trinity "gene" level would be one way forward. Salmon + a decoy transcriptome/genome (refer to their documentation) should be especially accurate, I believe.

fastqc confirmed removal of adapters and per base quality but failed “Sequence Duplication Levels”. <- RNAseq data will almost always fail on this fastqc module.

I'd also check for rRNA as was suggested by @andres.firrincieli by either eliminating them at the read level (e.g., with SortMeRNA) and reassembling the data, or by aligning the contigs against an rRNA database (e.g., using MMseqs2/Diamond; these are faster alternatives to blast).

ADD REPLY • link 3.0 years ago by Dunois ★ 2.5k

0

Entering edit mode

Thanks for suggesting SortMeRNA. I have not used before. So, can I use it over the output results from BBduk+Trimmomatic?

ADD REPLY • link 3.0 years ago by sarahmanderni ▴ 100

1

Entering edit mode

Yeah, it's input agnostic--all that matters is that the inputs are reads (or pairs thereof). Also if you suspect contamination, I would recommend running Kraken2 after SortMeRNA. Just map everything against the PlusPFP database that the developers provide, and retain only those reads that don't map to anything there.

ADD REPLY • link 3.0 years ago by Dunois ★ 2.5k

0

Entering edit mode

Thanks for the response! Yes, I aim to run gene-level DE analysis. It is a plant genome and I am not sure what should I expect about homeologs in plants. Also, I did not plan to do the quantification using bowtie2. I used it as a temporary aligner require by the clustering method I have chosen (https://github.com/Oshlack/Corset/wiki/Example) to cluster the contigs and eliminate redundancies. Next, I wanted to use e.g. RSEM to remap the reads to the clustered transcriptome and quantify the expression levels. I had the impression that Salmon is fast but bowtie2 is more accurate (?). Now as suggested by you and Dunois will try Salmon. But I am not sure if this way I still need to use Corset for clustering or not. Also, expected to get similar info from Busco and Transrate but will try it now.

ADD REPLY • link 3.0 years ago by sarahmanderni ▴ 100

1

Entering edit mode

I would not cluster at all, and retain all isoforms (~ contigs). The idea is to "accumulate" the isoform level counts back up to the gene level, and then use those for downstream analysis. I suppose you'd lose some reads that would have gotten aggregated under a particular Trinity "gene" otherwise if you were to cluster them away prematurely. If it is duplicate sequences you are worried about, Salmon takes care of these automatically.

ADD REPLY • link 3.0 years ago by Dunois ★ 2.5k