De novo transcriptome from PE reads with high multi-mapping rate (Bowtie2)
1
0
Entering edit mode
2.5 years ago
Dunois ★ 2.3k

So I have the results from mapping back PE reads to the transcriptome that was assembled from these reads. The mapping with performed with Bowtie2 with the following parameters --no-unal --threads 16 -q -k 20 (with everything else at default values).

19048560 reads; of these:
19048560 (100.00%) were paired; of these:
742212 (3.90%) aligned concordantly 0 times
4705703 (24.70%) aligned concordantly exactly 1 time
13600645 (71.40%) aligned concordantly >1 times
----
742212 pairs aligned concordantly 0 times; of these:
36458 (4.91%) aligned discordantly 1 time
----
705754 pairs aligned 0 times concordantly or discordantly; of these:
1411508 mates make up the pairs; of these:
1022517 (72.44%) aligned 0 times
67036 (4.75%) aligned exactly 1 time
321955 (22.81%) aligned >1 times
97.32% overall alignment rate


The report indicates that a majority of the read pairs aligned concordantly > 1 time(s). Should I be concerned about this? Is this a sign that the assembled transcriptome is quite redundant? (BUSCO scores indicate high duplication levels even after clustering.)

RNA-Seq read support Assembly bowtie2 • 1.3k views
2
Entering edit mode

Is this a sign that the assembled transcriptome is quite redundant?

well, I would say yes. You can also spot this from the BUSCO result: high rate of duplication!

It seems you have quite some sequence variability in your transcripts (which can be explained biologically as well here, multiple wild individuals pooled) such that they don't get assembled (or clustered) together. However when you look at the BUSCO result (which works on protein level, important to remember) many of the proteins seem similar (even potentially identical). the number might be, or will be, even larger than the 40% of BUSCO, as it only reports that number of repeated busco-genes, and does not count the number of times it is actually duplicated.

0
Entering edit mode

@lieven.sterck so this would alluding to the isoform issue that @h.mon has addressed in their answer then? What do you make of the data I posted in my reply to that answer?

1
Entering edit mode

yes, indeed.

it is a bit of a grey zone as I don't think they all will be genuine isoforms (you might also pick up allelic variations , from within an individual as well as from different individuals), but technically they pose the same issue for your mapping result.

Well, that alignment result certainly makes more sense, but as you pointed out yourself, there are valid transcripts that now have been removed. Still the overall result is much more reassuring than your initial one.

0
Entering edit mode

@lieven.sterck thank you for your feedback.

I suppose I'll just go ahead with the rest of the analysis. My current objective is to find a specific set of proteins, so the extra baggage shouldn't matter too much. But I presume this might become an issue when I'd like to annotate the transcriptome? (I've never done an annotation before.)

0
Entering edit mode

correct.

I was going to ask how crucial this 'assembly' is. It likely can be improved but that will require serious additional work, which might not be worth the effort, depending on the goal.

0
Entering edit mode

Well, it's very crucial in the sense it is the first data omics data from this organism. In that sense, my objective (apart from finding the genes/proteins of interest) is to deliver a high quality annotation of the transcriptome if the situation permits.

Edit: off the top of your head, could you perhaps theorize what this additional work might entail?

0
Entering edit mode

Genomic sequencing with a single worm, if possible (or the minimum you can get away with). Long read sequencing (nanopore) for both genome (you are only doing RNAseq at moment) and RNA, if possible. Keep in mind there is a reason they say a $1K genome that needs$100K in annotation. It you are not a biologist so it would be useful to get one with domain knowledge for this organism involved in annotation early.

1
Entering edit mode

Thank you @genomax. Hmm so basically a high quality annotation would involve having to obtain the genome at some point? Getting additional expert help is probably going to be very difficult. I doubt I'll get any additional resources (other than computing power) for the annotation, if I choose to try and annotate the transcriptome.

1
Entering edit mode

Getting and annotating the genome will indeed give you the highest quality annotation for this species, though this comes will seriously additional work!! and is not a route you want to embark on unprepared, unfortunately.

For annotating the current transcriptome, and given you do not have much experience in the annotation field, you could consider tools/resources such as Trapid (and/or others alike), those will do a decent job but might not give you the highest quality possible (it remains a quite automated approach)

0
Entering edit mode

given you do not have much experience

I don't have any experience in this whatsoever. This will be my first time doing this. And I suppose I must clarify that by "high quality annotation" (paraphrased) I basically meant annotating what I have as well as I can.

I have not heard of Trapid before, thank you. That's web-only by the looks of it, and proprietary? Do you know of any alternatives that'd run as standalone?

1
Entering edit mode

Hi, Any specific reason for asking to return (upto) 20 algn. per pair, using -k 20 ? I would think that using either of these combinations would be better - 1) -k 1 --best (hard thresholding to return only 1 algn. and the best one at that. 2) Or, if not wanting to hard threshold to return only 1 algn., => -a -m 3 --best --strata (as in bowtie manual)

0
Entering edit mode

I had that on because of what I read here. I'll give the options you suggested a try. Does that really affect the mapping rate though?

Edit: just realized you were referring to Bowtie and not Bowtie2?

1
Entering edit mode

Doing this will limit reads to only one alignment ignoring other locations where the read multi-maps. Keep that in mind. This can bias alignment of reads to certain locations (if one is not chosen randomly among all possible locations where the read can align). bbmap.sh can do this for multi-mappers via option

ambiguous=best          (ambig) Set behavior on ambiguously-mapped reads (with
multiple top-scoring mapping locations).
best    (use the first best site)
toss    (consider unmapped)
random  (select one top-scoring site randomly)
all     (retain all top-scoring sites)

0
Entering edit mode

By "this" you mean implementing @Amitm's proposal?

0
Entering edit mode

Is this a sign that the assembled transcriptome is quite redundant?

Quality of transcriptome assembled is going to be dependent on the data that went into making it. Did you assemble the transcriptome from just 20 M reads? What is the genome size supposed to be?

0
Entering edit mode

Hi genomax,

I had already sorted out rRNA with SortMeRNA before assembly. But I didn't check for contamination after the fact. Should I screen for rRNAs again?

0
Entering edit mode

I saw that you had de novo assembled the data after I originally made the rRNA comment. I edited the comment above since. If you are sure rRNA is gone then can you address the questions above? Have you done CD-HIT analysis to remove redundancy from your assembly?

0
Entering edit mode

Sorry for the late follow up (I hit the 6-hour post limit).

• Yes, the transcriptome was assembled from only 20 M reads. I don't know how big the genome is supposed to be: the organism has no sequenced genome available (nor a transcriptome, for that matter).
• I can't guarantee that the rRNA is gone. I guess there's still some weird mitochondrial stuff hanging in there? (I caught some sequences of that sort from blasting random sequences from the assembly.)
• I've clustered with MMseqs2. I ran two consecutive clustering steps wherein the program clustered away all sequences that were 100% identical irrespective of length, leaving the clusters represented by the longest sequence from each cluster. I did only two clustering steps because I noted that the resulting transcriptome's size in number of sequences had plateaued more or less.
1
Entering edit mode

So you could still have transcripts that are completely contained inside a longer transcript. How many transcripts do you currently have? There has to be a related species available in GenBank by now so you should have some idea of what the genome size should be.

0
Entering edit mode

So you could still have transcripts that are completely contained inside a longer transcript.

Hmm so you suspect there are still (fragmented) duplicates after two rounds of clustering with taking the longest sequence as the representative? Trinity originally produced 151282 transcripts. I had 150765 transcripts after the first round of clustering, and 150760 transcripts after the second round. Edit: I chose to cluster at 100% identity as I didn't want to cluster away potential isoforms by accident.

I've checked related species with available genomes. The closest one I could see was in the same class, but that was it. Would that actually suffice?

0
Entering edit mode

150K transcripts would mean your genome could be similar in size to human genome. Is that correct? If not I suggest that you try CD-HIT-EST and see what that does to your dataset.

0
Entering edit mode

As you've suggested, I clustered with CD-HIT-EST, and that's yielded 149977 transcripts. I called CD-HIT-EST with the following parameters -c 1.0 -aS 1.0 -uL 0 -uS 1.0 -aL 0.0 -r. That's definitely more sequences clustered away, but that's still not probably going to account for the ca. 75% multi-mapping rate, right?

And hmm I have no clue how big its genome could be: all I know is that it is a cnidarian polychaete.

0
Entering edit mode

polychaete-hosting cnidarians

Does it mean there are two genomes in there or just one? Looks like your initial clustering was ok.

I am now thinking aloud so consider the following as such. Perhaps your raw data is skewed in some way to begin with and does not represent the entire genome? Have you tried to use BUSCO (link) to see how complete your transcriptome is.

0
Entering edit mode

"Polychaete-hosting cnidarians"? No, it's just a polychaete (where did you even get that phrase from?). The RNA-seq is of--at least in theory--a single species. What happened was a bunch of these poor things were caught in the wild, identified anatomically, sacrificed, then pooled and sequenced as a single sample. I guess that's the gist of it anyway, I wasn't involved in the fieldwork or the sequencing itself.

As I had mentioned in the OP, I did run a BUSCO on this (with the eukaryota reference gene set). Here are the values C:82.7%[S:42.7%,D:40.0%],F:9.8%,M:7.5%,n:255. It seems quite complete? This does not, of course, negate your hypothesis that the entirety of the mRNA space was perhaps not captured. (Is this what you meant by "not represent the entire genome"?)

Could it be possible that the multi-mapping is because the transcriptome here is actually two or more closely related organisms? I mean, the individuals that contributed the RNA were identified on the basis of morphology, so it is possible that somebody mixed up closely related species?

1
Entering edit mode

When I searched with cnidarian polychaete all google hits seem to indicate that these two organisms seem to be together, with polychaetes being symbionts of cnidarians.

I am puzzled by the multi-mapping so that was one wild thought that perhaps entire genome is not captured. If you had individuals from several related species there may indeed be more than one genome represented in your set. Perhaps that is the reason for the multi-mapping.

All good questions to consider.

0
Entering edit mode

I went back and checked. Annelids. Annelids. Not cnidarians. Polychaetes are annelids. I am terribly sorry for confusing you there. I somehow had it stuck in my mind that the polychaetes are cnidarians for a while now. I apologize.

Hmm but even with partial genome capture, two different state-of-the-art clustering tools have indicated that most of the assembly is not redundant. Maybe it's Bowtie2 that's messing up? Does bbmap produce similar output? (I've never used it before, but I am looking at its in-built help section now.) I could try running bbmap and look at what that says?

If we were to chase the mix-of-species hypothesis on the side, how would I even test this?

(Thank you for all the valuable discussion here and elsewhere, by the way.)

Edit: if it data is a mix from more than one species, would it show up as an observable/notable effect in one of the FastQC plots?

0
Entering edit mode

if it data is a mix from more than one species, would it show up as an observable/notable effect in one of the FastQC plots?

If they had appreciable GC content differences then yes. You should see two peaks (or prominent hump).

I don't think bowtie2 is messing up. The reads must be multi-mapping. You will likely see a similar result with bbmap. What bbmap allows you to do in addition is to select what you do with reads that multi-map. See the option for ambiguous=. If your aim is to do diff exp analysis using bbmap may be an option.

Since you have a transcriptome you could also use salmon that uses a different way of assignment of reads to generate counts against transcripts (LINK).

Edit: @Rob Patro (author of Salmon) suggested https://github.com/Oshlack/Corset/wiki .

0
Entering edit mode

Here's a link to the sequence GC content plot from FastQC. And here's a link to the per base sequence content plot. There's a bit of a hump in the GC plot but it doesn't look bimodal?

I have quantified with salmon already (and I had a couple of exchanges with @Rob regarding salmon outputs, see here and here). Salmon indicates a similar overall mapping rate for this assembly as Bowtie2 did: 96.83%. But I couldn't find detailed mapping statistics like the ones from Bowtie2 for salmon so I didn't follow up on the salmon values for this purpose (I ended using them for something else).

And thank you for pointing me to Corset @genomax (and @Rob indirectly). I guess I could use Corset to get an estimate of the sample's gene content then? I suppose this is what @Rob had in mind when they suggested this?

1
Entering edit mode
2.5 years ago
h.mon 34k

Trinity assemblies may produce a lot of duplicates, and it doesn't necessarily mean there is something wrong with your data or with your assembly (see, for example, the supplementary materials from De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers).

The Trinity wiki has some suggestions to deal with this, depending on the downstream analyses one wishes to perform, see There are too many transcripts! What do I do?

two different state-of-the-art clustering tools have indicated that most of the assembly is not redundant

I don't know how MMseq2 clustering would work for a transcriptome assembly, but CD-HIT won't work well for isoforms containing different exons (rather, it does work well, but it doesn't do what you are expecting), and the results you observed are what I would expect.

0
Entering edit mode

So you're suggesting the high multi-mapping rate is just the isoforms + Trinity variants? That could very well be the case. I suppose I'd be able to rule out if it's just the isoforms by checking the mapping rates for the "longest isoforms only" subset of transcripts?

I counted the number of single-isoform transcripts and multi-isoform transcripts (as Trinity labels them), and it appears that for the set of 151282 transcripts there are 77391 single-isoform transcripts, and 20870 multi-isoform ones.

I had Bowtie2 map against the longest-isoforms-only subset, and this is what I got:

19048560 reads; of these:
19048560 (100.00%) were paired; of these:
1775828 (9.32%) aligned concordantly 0 times
15247210 (80.04%) aligned concordantly exactly 1 time
2025522 (10.63%) aligned concordantly >1 times
----
1775828 pairs aligned concordantly 0 times; of these:
142366 (8.02%) aligned discordantly 1 time
----
1633462 pairs aligned 0 times concordantly or discordantly; of these:
3266924 mates make up the pairs; of these:
3025733 (92.62%) aligned 0 times
181348 (5.55%) aligned exactly 1 time
59843 (1.83%) aligned >1 times
92.06% overall alignment rate


Looks like you're right? But it also appears that while the multi-mapping rate has plummeted, the proportion of completely unmapped reads has shot up? Here are the original values (copied from the OP) for comparison:

19048560 reads; of these:
19048560 (100.00%) were paired; of these:
742212 (3.90%) aligned concordantly 0 times
4705703 (24.70%) aligned concordantly exactly 1 time
13600645 (71.40%) aligned concordantly >1 times
----
742212 pairs aligned concordantly 0 times; of these:
36458 (4.91%) aligned discordantly 1 time
----
705754 pairs aligned 0 times concordantly or discordantly; of these:
1411508 mates make up the pairs; of these:
1022517 (72.44%) aligned 0 times
67036 (4.75%) aligned exactly 1 time
321955 (22.81%) aligned >1 times
97.32% overall alignment rate


Is that just all those reads that were representing the now missing isoforms?

Edit: also just as a side note, MMseqs2 would cluster in a similar fashion to CD-HIT, if the right submodules and parameters are called. Just that it cannot do global alignment.

0
Entering edit mode

The longest isoform isn't necessarily the "best" isoform, this is a point made in the Trinity wiki, and repeatedly made at the Trinity mailing lists. For example, the longest isoform could just be a transcript containing an intron, and not necessarily the most expressed transcript - the longest isoform may lack the "most expressed" exon, thus selecting only the longest isoform may decrease mapping rate.

So you could select the most expressed transcripts, or you could map to the SuperTranscripts, to access the if those subsets decrease substantially the multi-mapping rate without increasing too much the unmapped rate. Then, to be throughout, you could also access the quality of the selected subset with BUSCO, to use an independent metric to check if those subsets are not discarding too many "important" genes.

0
Entering edit mode

The longest isoform isn't necessarily the "best" isoform, this is a point made in the Trinity wiki, and repeatedly made at the Trinity mailing lists. For example, the longest isoform could just be a transcript containing an intron, and not necessarily the most expressed transcript - the longest isoform may lack the "most expressed" exon, thus selecting only the longest isoform may decrease mapping rate.

This goes without saying I suppose. I only used the longest isoforms as a convenient representative set to see if the high multimapping rate was indeed caused by the multitude of isoforms themselves.

I'll look into both solutions you suggested. Is it correct that one mustn't put too much stock into the transcript-level expression metrics? (Which would make the SuperTranscripts route the safer path to quantification.)

1
Entering edit mode

Is it correct that one mustn't put too much stock into the transcript-level expression metrics?

In this case, since the transcriptome is new and not well characterized. It also does not have the support of genomic sequence.

0
Entering edit mode

Although it is always nice (and recommended) to explore the data and check if everything seems to be in the right place, personally, I think your transcriptome seems fine, and I would be happy to continue with downstream analyses.

Read the Trinity wiki, there are wrapper scripts and suggestion for downstream analyses for a wide variety of cases. Their recommended use for the SuperTranscripts is for variant calling. Trinity outputs a gene-to-transcript mapping, so it is easy to perform both gene- and transcript-level analyses - there are wrapper scripts for these analyses, too.

The transcript-level differential expression depends more on the number of biological replicates and depth of sequencing per sample - one needs more samples and deeper sequencing per sample to have good power for transcript-level analysis. I never dealt with such deeply sequenced datasets, so I only performed gene-level analysis.