Nanopore: Should you remove reads below certain length ?
1
1
Entering edit mode
18 months ago
chrys ▴ 60

Hi there,

We are using Nanopore for some transcriptomics analysis and I am building a QC pipeline. There is a default statement that it could be good to remove reads below 500 bp in length but I could not find arguments for or against this in the literature.

So my question is, assuming that the quality of the read is good, is there an argument why one should remove shorter reads below a certain threshold ( except of course like super short stuff like 10 or so) ?

Thanks!

QC Nanopore • 1.5k views
3
Entering edit mode
18 months ago
ponganta ▴ 540

As always, one needs to know which downstream analysis you have in mind.

It is recommended, yes, but it doesn't make much sense for transcriptomic data. Many transcripts are well below 500 nt in length, so you tend to lose information. I lately noticed this in BUSCO-analyses. My suggestion would be to try length filtering, and compare it to other measures (while watching out for transcriptome completeness). Filtering for quality with nanofilt, and selection of full-length transcripts using something like pychopper would be more appropriate in my opinion.

0
Entering edit mode

Thanks for information. I am actually using isOncorrect which can function as a wrapper for pychopper but I was thinking about how to process raw input to the pipeline. I will follow your advice and do quality filtering before hand and try to use full-length transcripts for downstream analysis.

Thank you !

0
Entering edit mode

Did you notice any susbtantial quality improvement by using isONcorrect? If yes, how did you measure transcript quality? And how does isONcorrect act as "wrapper" around pychopper, isn't it downstream from de novo clustering?

Cool to see that the isON-pipeline is actually being used now! I found isONclust to be extremely reliable.

1
Entering edit mode

Did you notice any substantial quality improvement by using isONcorrect?

Definitely yes - although my use case was / is to quantify highly polymorphic regions. It helped quite a bit. Before correction we had some trouble mapping to the correct transcripts. It is not yet perfect but it has gotten better. We also used R9 flow-cells so it might be that with the R10s it will get significantly better.

If yes, how did you measure transcript quality?

In our case, we had prior knowledge from diagnostic genotyping and PCR. So we had a good idea of what transcripts should be there and more importantly what they "should" look like. Before the correction I did some mapping tests against all available known transcripts of that locus (which are quite a lot) and found that the reads went all over the place. After correction this got better. I am currently considering trimming a roughly 100bp headtrimm with NanoFilt since there is a lot of noise in these regions and this might improve things a little more although this suffers from a similar "i heard it through the grapevine" syndrome - in this case though I at least have data that shows that it might help.

And how does isONcorrect act as "wrapper" around pychopper, isn't it downstream from de novo clustering?

I was a little unclear, basically isONcorrect delivers a wrapper script that includes pychopper. In this case pychopper precedes isONclust.

I actually really like using the isON-pipelines - one does not have time to re-invet the wheel ;)

0
Entering edit mode

Cool! So you are in principal conducting reference-based clustering, followed by self-error-correction with isONcorrect?

Is your goal quantification? If yes, did you use dRNA-Seq, or the old cDNA-based protocol (SQK-PCS108 I believe)? I found the latter to produce a considerable amount of fragmented transcripts due to problems with the ssPCR - so be warned. Some "full-length" transcripts may have been truncated prior to adapter ligation (It's old knowledge though from 2019/2020).

I'd be happy if you'd report back once you have a preprint/new results. I think your approach is promising.

0
Entering edit mode

Cool! So you are in principal conducting reference-based clustering, followed by self-error-correction with isONcorrect?

Yes, this is what I am currently trying to implement and somewhat generalize for our particular use cases.

The goal is transcript specific quantification yes. We are currently working with the PCR-cDNA kit SQS-PCS109 because we had some trouble establishing the dRNA-Seq kit reliably.

We had some very strange transcripts being generated in some samples with long stretches of homo-polymers in the middle of the transcripts. I have not yet found a reason for this and because of (of course) time pressure we have worked with the PCR-cDNA kit for now.

I know performing the PCR and the RT are not optimal but we could generate reliable data with it so this is what we have.

I will absolutely report back this thread has been very illuminating and helpful!

0
Entering edit mode

Hello,

If I understood correctly, you trim raw reads, cluster them (using isONCluster, which is part of pipeline) and then correct for error using isONCorrect ??

b) so once you have generated error corrected transcript do blast these transcripts directly against the NCBI non-redundant protein database to see their function?? I suppose there might be many transcript ??

0
Entering edit mode

As far as I understood, they basically do alignment-based transcript quantification. For this, you do not need isONclust (which is used for de novo clustering, the exact opposite!).

Instead they align ONT-transcripts to reference sequences, and cluster them based on their position. This information is then used with isONcorrect to conduct error correction. After that, you can do what you want, e.g. blasting, DEG, or whatever.

Regarding transcript variants, this is an interesting question. OP, can you elaborate on your strategy for alignment? If I may ask, do you use genome or transcriptome as reference? How do you conduct splice aware alignment?

0
Entering edit mode

Hi sorry for the delayed response,

I am happy to elaborate, regarding transcript variants I use a multi-pass alignment strategy which I kind of loaned from STARs two pass mapping.

I first align the long reads splice aware with minimap2 against genomic references (note that this is decoupled from our efforts to quantify highly polymorphic transcript expression so for this isONclust is not necessary).

This first step should help in catching "everything", afterwards we use a transcript assembler like stringtie to make guided transcript assembly, then I construct a reference transcriptome and re-align using minimap2 in long read mode. Quantification is done with salmon on the reads mapped with minimap2. This will yield every transcript variant you have some sequencing depth on.

Note that this pipeline is still under development. You can check it out here: https://github.com/liscruk/two-pass-nanopore-transcriptomics

Another note, is that I noticed that is you should not trim your reads before using isOnclust this will have very weird effects on clustering.