We are using Nanopore for some transcriptomics analysis and I am building a QC pipeline. There is a default statement that it could be good to remove reads below 500 bp in length but I could not find arguments for or against this in the literature.
So my question is, assuming that the quality of the read is good, is there an argument why one should remove shorter reads below a certain threshold ( except of course like super short stuff like 10 or so) ?
Thanks for information. I am actually using isOncorrect which can function as a wrapper for pychopper but I was thinking about how to process raw input to the pipeline. I will follow your advice and do quality filtering before hand and try to use full-length transcripts for downstream analysis.
Thank you !
Did you notice any susbtantial quality improvement by using
isONcorrect? If yes, how did you measure transcript quality? And how does isONcorrect act as "wrapper" around
pychopper, isn't it downstream from de novo clustering?
Cool to see that the isON-pipeline is actually being used now! I found
isONclustto be extremely reliable.
Definitely yes - although my use case was / is to quantify highly polymorphic regions. It helped quite a bit. Before correction we had some trouble mapping to the correct transcripts. It is not yet perfect but it has gotten better. We also used R9 flow-cells so it might be that with the R10s it will get significantly better.
In our case, we had prior knowledge from diagnostic genotyping and PCR. So we had a good idea of what transcripts should be there and more importantly what they "should" look like. Before the correction I did some mapping tests against all available known transcripts of that locus (which are quite a lot) and found that the reads went all over the place. After correction this got better. I am currently considering trimming a roughly 100bp headtrimm with NanoFilt since there is a lot of noise in these regions and this might improve things a little more although this suffers from a similar "i heard it through the grapevine" syndrome - in this case though I at least have data that shows that it might help.
I was a little unclear, basically isONcorrect delivers a wrapper script that includes pychopper. In this case pychopper precedes isONclust.
I actually really like using the isON-pipelines - one does not have time to re-invet the wheel ;)
Cool! So you are in principal conducting reference-based clustering, followed by self-error-correction with
Is your goal quantification? If yes, did you use dRNA-Seq, or the old cDNA-based protocol (SQK-PCS108 I believe)? I found the latter to produce a considerable amount of fragmented transcripts due to problems with the ssPCR - so be warned. Some "full-length" transcripts may have been truncated prior to adapter ligation (It's old knowledge though from 2019/2020).
I'd be happy if you'd report back once you have a preprint/new results. I think your approach is promising.
Yes, this is what I am currently trying to implement and somewhat generalize for our particular use cases.
The goal is transcript specific quantification yes. We are currently working with the PCR-cDNA kit SQS-PCS109 because we had some trouble establishing the dRNA-Seq kit reliably.
We had some very strange transcripts being generated in some samples with long stretches of homo-polymers in the middle of the transcripts. I have not yet found a reason for this and because of (of course) time pressure we have worked with the PCR-cDNA kit for now.
I know performing the PCR and the RT are not optimal but we could generate reliable data with it so this is what we have.
I will absolutely report back this thread has been very illuminating and helpful!
If I understood correctly, you trim raw reads, cluster them (using isONCluster, which is part of pipeline) and then correct for error using isONCorrect ??
b) so once you have generated error corrected transcript do blast these transcripts directly against the NCBI non-redundant protein database to see their function?? I suppose there might be many transcript ??
As far as I understood, they basically do alignment-based transcript quantification. For this, you do not need
isONclust(which is used for de novo clustering, the exact opposite!).
Instead they align ONT-transcripts to reference sequences, and cluster them based on their position. This information is then used with
isONcorrectto conduct error correction. After that, you can do what you want, e.g. blasting, DEG, or whatever.
Regarding transcript variants, this is an interesting question. OP, can you elaborate on your strategy for alignment? If I may ask, do you use genome or transcriptome as reference? How do you conduct splice aware alignment?
Hi sorry for the delayed response,
I am happy to elaborate, regarding transcript variants I use a multi-pass alignment strategy which I kind of loaned from STARs two pass mapping.
I first align the long reads splice aware with minimap2 against genomic references (note that this is decoupled from our efforts to quantify highly polymorphic transcript expression so for this
isONclustis not necessary).
This first step should help in catching "everything", afterwards we use a transcript assembler like
stringtieto make guided transcript assembly, then I construct a reference transcriptome and re-align using minimap2 in long read mode. Quantification is done with salmon on the reads mapped with minimap2. This will yield every transcript variant you have some sequencing depth on.
Note that this pipeline is still under development. You can check it out here: https://github.com/liscruk/two-pass-nanopore-transcriptomics
Another note, is that I noticed that is you should not trim your reads before using
isOnclustthis will have very weird effects on clustering.