Question

Quality filtering prior to pseudoalignment

0

Entering edit mode

5 months ago

Roberto ▴ 20

Hi,

I have often read (and anecdotally confirmed) that adapter removal, quality trimming and such are not necessary for simple estimation of transcript relative abundance in a pseudoalignment framework. My tool of choice is Kallisto, and I am doing bulk RNAseq on a NextSeq, for context.

When low quality matches are going to be discarded, is there a point in actually quality filtering reads? Certainly it's going to reduce the number of mismatches. Assuming one is willing to accept background noise in exchange for more depth/coverage, the tradeoff could be worth it (eg, in the case of a quality bias towards a few samples).

A middle ground could be disabling quality filtering but performing trimming instead, but I imagine that might end up with a lot of very short reads that you might have to reject to avoid them matching all over the place...

Anyhow, I would appreciate advises/explanations or general thoughts

fastp rna-seq pseudoalignment kallisto • 694 views

ADD COMMENT • link updated 5 months ago by Brian Bushnell 20k • written 5 months ago by Roberto ▴ 20

1

Entering edit mode

If adapters are present then these should be removed. Pseudoalignment does not do soft-clipping as traditional aligners, so adapters can reduce mapping rate or lead to wrong alignments. Maybe I personally am paranoid on this, but I always remove the dirt (here adapters) from any data before doing any additional processing step. Same goes for alignment even if it maybe was not strictly necessary, I just don't like taking guesses what contamination in data might or might not do. Just remove it, it's not a major computational burden.

ADD REPLY • link 5 months ago by ATpoint 82k

0

Entering edit mode

I totally second this. Nowadays, removing the adapter is so straightforward that I don't give it much thought. I was asking about the general quality filters, eg. rejecting reads on the basis of too many basis not passing the quality threshold.

ADD REPLY • link 5 months ago by Roberto ▴ 20

0

Entering edit mode

To expand on what ATpoint already articulated. Why would we do something, even if is easy to do, if we are 100%, absolutely, positively certain it is not needed? For one, others (colleagues, reviewers) may not share our opinion.

How can anyone be sure that adapter trimming is not required in this case without doing a large-scale comparative analysis, say 100 or so datasets, with and without trimming? Even if we did this, someone could come who has done a comparative analysis on 200 datasets and has a different conclusion. It seems much easier to trim the adapters and not worry about something that we can't easily prove with 100% certainty.

ADD REPLY • link 5 months ago by Mensur Dlakic ★ 27k

score 1 · Answer 1 · 2023-11-15

1

Entering edit mode

5 months ago

dsull ★ 5.9k

Pseudoalignment with large k-mer sizes is pretty robust to adapter contamination (since adapters are constructed specifically to have high sequence divergence from the reference genome/transcriptome; and also, not every k-mer in a read is used in mapping). For standard assays+analyses, I usually don't trim because it doesn't really make a difference. But for some of the more custom things I do, I will always trim. In any case, trimming doesn't hurt (unless you're doing a project that involves analyzing mining thousands of public RNAseq datasets -- in which case trimming could be a hurdle; though, with the new kallisto release, you can directly pipe paired-end trimming output into kallisto).

Quality trimming is fine -- even if you have a read that maps less specifically to targets, it's still information that can be used in the abundance estimation (even a single k-mer won't "map all over the place" for large k). I'd prefer it over filtering (if the first 75 bp's of your read looks amazing and the second 75 bp's of your read looks like crap; just use the first 75 bp's of your read).

ADD COMMENT • link 5 months ago by dsull ★ 5.9k

0

Entering edit mode

Totally understand your point. If I have to "merge" your position with the other comments, I would say the case for trimming is that it's not necessary, but is sufficiently "inexpensive" to be worth to keep it in the routine...

And I agree on your second point. I am surprised that that's not the default

ADD REPLY • link 5 months ago by Roberto ▴ 20

0

Entering edit mode

Trimming is cheap, but I prefer to use non-quality-trimmed reads for alignment-free kmer matching. Unless you have recalibrated them, the quality scores are not going to be accurate anyway, and any actually low-quality region simply won't match anything and thus will be ignored.

If you're using some threshold (like "at least 80% of kmers must match the reference") which would be impacted by low quality, you can also run kmer-matching, the trim the nonmatching sequences only and run them again.

ADD REPLY • link 5 months ago by Brian Bushnell 20k