So I have the results from mapping back PE reads to the transcriptome that was assembled from these reads. The mapping with performed with
Bowtie2 with the following parameters
--no-unal --threads 16 -q -k 20 (with everything else at default values).
19048560 reads; of these: 19048560 (100.00%) were paired; of these: 742212 (3.90%) aligned concordantly 0 times 4705703 (24.70%) aligned concordantly exactly 1 time 13600645 (71.40%) aligned concordantly >1 times ---- 742212 pairs aligned concordantly 0 times; of these: 36458 (4.91%) aligned discordantly 1 time ---- 705754 pairs aligned 0 times concordantly or discordantly; of these: 1411508 mates make up the pairs; of these: 1022517 (72.44%) aligned 0 times 67036 (4.75%) aligned exactly 1 time 321955 (22.81%) aligned >1 times 97.32% overall alignment rate
The report indicates that a majority of the read pairs aligned concordantly > 1 time(s). Should I be concerned about this? Is this a sign that the assembled transcriptome is quite redundant? (BUSCO scores indicate high duplication levels even after clustering.)
well, I would say yes. You can also spot this from the BUSCO result: high rate of duplication!
It seems you have quite some sequence variability in your transcripts (which can be explained biologically as well here, multiple wild individuals pooled) such that they don't get assembled (or clustered) together. However when you look at the BUSCO result (which works on protein level, important to remember) many of the proteins seem similar (even potentially identical). the number might be, or will be, even larger than the 40% of BUSCO, as it only reports that number of repeated busco-genes, and does not count the number of times it is actually duplicated.
@lieven.sterck so this would alluding to the isoform issue that @h.mon has addressed in their answer then? What do you make of the data I posted in my reply to that answer?
it is a bit of a grey zone as I don't think they all will be genuine isoforms (you might also pick up allelic variations , from within an individual as well as from different individuals), but technically they pose the same issue for your mapping result.
Well, that alignment result certainly makes more sense, but as you pointed out yourself, there are valid transcripts that now have been removed. Still the overall result is much more reassuring than your initial one.
@lieven.sterck thank you for your feedback.
I suppose I'll just go ahead with the rest of the analysis. My current objective is to find a specific set of proteins, so the extra baggage shouldn't matter too much. But I presume this might become an issue when I'd like to annotate the transcriptome? (I've never done an annotation before.)
I was going to ask how crucial this 'assembly' is. It likely can be improved but that will require serious additional work, which might not be worth the effort, depending on the goal.
Well, it's very crucial in the sense it is the first data omics data from this organism. In that sense, my objective (apart from finding the genes/proteins of interest) is to deliver a high quality annotation of the transcriptome if the situation permits.
Edit: off the top of your head, could you perhaps theorize what this additional work might entail?
Genomic sequencing with a single worm, if possible (or the minimum you can get away with). Long read sequencing (nanopore) for both genome (you are only doing RNAseq at moment) and RNA, if possible. Keep in mind there is a reason they say a $1K genome that needs $100K in annotation. It you are not a biologist so it would be useful to get one with domain knowledge for this organism involved in annotation early.
Thank you @genomax. Hmm so basically a high quality annotation would involve having to obtain the genome at some point? Getting additional expert help is probably going to be very difficult. I doubt I'll get any additional resources (other than computing power) for the annotation, if I choose to try and annotate the transcriptome.
Getting and annotating the genome will indeed give you the highest quality annotation for this species, though this comes will seriously additional work!! and is not a route you want to embark on unprepared, unfortunately.
For annotating the current transcriptome, and given you do not have much experience in the annotation field, you could consider tools/resources such as Trapid (and/or others alike), those will do a decent job but might not give you the highest quality possible (it remains a quite automated approach)
I don't have any experience in this whatsoever. This will be my first time doing this. And I suppose I must clarify that by "high quality annotation" (paraphrased) I basically meant annotating what I have as well as I can.
I have not heard of Trapid before, thank you. That's web-only by the looks of it, and proprietary? Do you know of any alternatives that'd run as standalone?
Hi, Any specific reason for asking to return (upto) 20 algn. per pair, using
-k 20? I would think that using either of these combinations would be better - 1)
-k 1 --best(hard thresholding to return only 1 algn. and the best one at that. 2) Or, if not wanting to hard threshold to return only 1 algn., =>
-a -m 3 --best --strata(as in bowtie manual)
I had that on because of what I read here. I'll give the options you suggested a try. Does that really affect the mapping rate though?
Edit: just realized you were referring to Bowtie and not Bowtie2?
Doing this will limit reads to only one alignment ignoring other locations where the read multi-maps. Keep that in mind. This can bias alignment of reads to certain locations (if one is not chosen randomly among all possible locations where the read can align).
bbmap.shcan do this for multi-mappers via option
By "this" you mean implementing @Amitm's proposal?
Quality of transcriptome assembled is going to be dependent on the data that went into making it. Did you assemble the transcriptome from just 20 M reads? What is the genome size supposed to be?
I had already sorted out rRNA with
SortMeRNAbefore assembly. But I didn't check for contamination after the fact. Should I screen for rRNAs again?
I saw that you had de novo assembled the data after I originally made the rRNA comment. I edited the comment above since. If you are sure rRNA is gone then can you address the questions above? Have you done
CD-HITanalysis to remove redundancy from your assembly?
Sorry for the late follow up (I hit the 6-hour post limit).
So, to answer your questions:
blasting random sequences from the assembly.)
So you could still have transcripts that are completely contained inside a longer transcript. How many transcripts do you currently have? There has to be a related species available in GenBank by now so you should have some idea of what the genome size should be.
Hmm so you suspect there are still (fragmented) duplicates after two rounds of clustering with taking the longest sequence as the representative? Trinity originally produced
151282transcripts. I had
150765transcripts after the first round of clustering, and
150760transcripts after the second round. Edit: I chose to cluster at 100% identity as I didn't want to cluster away potential isoforms by accident.
I've checked related species with available genomes. The closest one I could see was in the same class, but that was it. Would that actually suffice?
150K transcripts would mean your genome could be similar in size to human genome. Is that correct? If not I suggest that you try CD-HIT-EST and see what that does to your dataset.
As you've suggested, I clustered with
CD-HIT-EST, and that's yielded
149977transcripts. I called
CD-HIT-ESTwith the following parameters
-c 1.0 -aS 1.0 -uL 0 -uS 1.0 -aL 0.0 -r. That's definitely more sequences clustered away, but that's still not probably going to account for the ca. 75% multi-mapping rate, right?
And hmm I have no clue how big its genome could be: all I know is that it is a cnidarian polychaete.
Does it mean there are two genomes in there or just one? Looks like your initial clustering was ok.
I am now thinking aloud so consider the following as such. Perhaps your raw data is skewed in some way to begin with and does not represent the entire genome? Have you tried to use BUSCO (link) to see how complete your transcriptome is.
"Polychaete-hosting cnidarians"? No, it's just a polychaete (where did you even get that phrase from?). The RNA-seq is of--at least in theory--a single species. What happened was a bunch of these poor things were caught in the wild, identified anatomically, sacrificed, then pooled and sequenced as a single sample. I guess that's the gist of it anyway, I wasn't involved in the fieldwork or the sequencing itself.
As I had mentioned in the OP, I did run a BUSCO on this (with the eukaryota reference gene set). Here are the values
C:82.7%[S:42.7%,D:40.0%],F:9.8%,M:7.5%,n:255. It seems quite complete? This does not, of course, negate your hypothesis that the entirety of the mRNA space was perhaps not captured. (Is this what you meant by "not represent the entire genome"?)
Could it be possible that the multi-mapping is because the transcriptome here is actually two or more closely related organisms? I mean, the individuals that contributed the RNA were identified on the basis of morphology, so it is possible that somebody mixed up closely related species?
When I searched with
cnidarian polychaeteall google hits seem to indicate that these two organisms seem to be together, with polychaetes being symbionts of cnidarians.
I am puzzled by the multi-mapping so that was one wild thought that perhaps entire genome is not captured. If you had individuals from several related species there may indeed be more than one genome represented in your set. Perhaps that is the reason for the multi-mapping.
All good questions to consider.
I went back and checked. Annelids. Annelids. Not cnidarians. Polychaetes are annelids. I am terribly sorry for confusing you there. I somehow had it stuck in my mind that the polychaetes are cnidarians for a while now. I apologize.
Hmm but even with partial genome capture, two different state-of-the-art clustering tools have indicated that most of the assembly is not redundant. Maybe it's
Bowtie2that's messing up? Does
bbmapproduce similar output? (I've never used it before, but I am looking at its in-built help section now.) I could try running
bbmapand look at what that says?
If we were to chase the mix-of-species hypothesis on the side, how would I even test this?
(Thank you for all the valuable discussion here and elsewhere, by the way.)
Edit: if it data is a mix from more than one species, would it show up as an observable/notable effect in one of the
If they had appreciable GC content differences then yes. You should see two peaks (or prominent hump).
I don't think
bowtie2is messing up. The reads must be multi-mapping. You will likely see a similar result with
bbmap. What bbmap allows you to do in addition is to select what you do with reads that multi-map. See the option for
ambiguous=. If your aim is to do diff exp analysis using
bbmapmay be an option.
Since you have a transcriptome you could also use
salmonthat uses a different way of assignment of reads to generate counts against transcripts (LINK).
Edit: @Rob Patro (author of Salmon) suggested https://github.com/Oshlack/Corset/wiki .
Here's a link to the sequence GC content plot from
FastQC. And here's a link to the per base sequence content plot. There's a bit of a hump in the GC plot but it doesn't look bimodal?
I have quantified with
salmonalready (and I had a couple of exchanges with @Rob regarding
salmonoutputs, see here and here).
Salmonindicates a similar overall mapping rate for this assembly as
96.83%. But I couldn't find detailed mapping statistics like the ones from
salmonso I didn't follow up on the
salmonvalues for this purpose (I ended using them for something else).
And thank you for pointing me to
Corset@genomax (and @Rob indirectly). I guess I could use
Corsetto get an estimate of the sample's gene content then? I suppose this is what @Rob had in mind when they suggested this?