Question

Should I exclude unmapped reads and proceed with the analysis for low mapped samples, or omit them altogether?

0

Entering edit mode

9 months ago

Sib ▴ 60

I am conducting RNAseq analysis on raw reads of Solanum Lycopersicum (tomato). I am aligning the raw reads of 24 samples to the reference genome obtained from Ensemble using STAR. I am achieving mapping rates higher than 90% in most samples. All samples have a mapping rate of more than 86%, except for three samples with mapping rates of 25.7%, 7.5%, and 11.3%. The unmapped reads are attributed to "too short", which, based on my research, seem to be related to rRNA contamination.

This is peculiar because, as far as I know, samples with rRNA contamination typically exhibit more than one peak in the Per Sequence GC content plot. However, my samples only show a single peak and pass this test!

Regardless, my primary goal is to conduct a differential expression analysis. It's not possible for me to redo sequencing. I am uncertain whether I can exclude unmapped reads from the BAM file and proceed with the analysis for these three samples, or if I should omit them from the analysis altogether.

STAR mapping RNAseq • 631 views

ADD COMMENT • link updated 9 months ago by dthorbur ★ 2.5k • written 9 months ago by Sib ▴ 60

0

Entering edit mode

The samples with low mapping rates can be discarded as they are likely to be contaminated.

ADD REPLY • link 9 months ago by Huiyang ▴ 190

score 2 · Accepted Answer · 2024-02-12

2

Entering edit mode

9 months ago

dthorbur ★ 2.5k

There are a few steps to I normally take before deciding the remove a sample, but these 3 with low mapping are certainly good candidates.

I would look at the distribution of samples using some form of ordination analysis (i.e., PCA, NMDS, etc...), try to see what the unmapped reads are through using things like KRAKEN2 or BLAST, and check the number of reads that are still mapping in comparison to other samples.

If the low mapping samples cluster away from replicates it's pretty easy to justify removal, similar if they have only a small proportion of the mapped reads compared to the rest of your data as normalisation steps would destroy your other samples. The identification of what the reads likely are is more to help understand what went wrong - failed ribo-depletion, contamination, etc... All good things to know for future sample processing.

ADD COMMENT • link 9 months ago by dthorbur ★ 2.5k

0

Entering edit mode

Thank you for your response. However, PCA is typically plotted using normalized data. The low mapped sample has about 6 millions uniquely mapped reads and it's replicates have about 20 millions reads. Can it lead to a wrong normalization, If I employ the normalization method? If I normalize that way and subsequently generate a PCA plot and the low mapping samples do not cluster apart from the replicates, can I rely on this result and retain the low mapping samples?

ADD REPLY • link 9 months ago by Sib ▴ 60

0

Entering edit mode

You can just use presence/absence data on transcripts being mapped, or even PCA on raw data though that may be explaining abundance differences more than anything else in the first few PCAs

ADD REPLY • link 9 months ago by dthorbur ★ 2.5k