Question

Should I include multi-mapped reads for analysis that does not require significant differential expression of genes?

1

Entering edit mode

6.0 years ago

k.woolf ▴ 10

I am in the process of analyzing and visualizing my RNA-seq data from a time course experiment (5 time points, no treatments).

I first mapped and aligned reads to my organism's reference genome using Tophat, then filtered the output bam files to maintain high quality/likelihood of correct alignment and to remove multi-mapped reads, then I used Cufflinks to assemble transcripts, and finally used HTseq-count to generate count data for DE analysis with an edgeR/limma based package.

I also want to perform a type of targeted heatmap/clustering analysis to look at larger lists of GOIs (that are not necessarily significantly DE) to determine which of these genes appear to change similarly, change the most over time, or have the highest expression levels vs. which ones do not appear to be induced or very highly expressed during this time course. I plan to use this information to weed out genes that are not likely involved in the process I’m studying, and to keep those demonstrating certain patterns and levels of temporal change in mind as candidates for later analysis. For example, some of the genes in my lists share functional names and have high sequence similarity and I would like to determine which specific genes within these groups of similarly annotated genes are most responsive throughout this time course.

Is it valid to use data derived from HTseq-count where multi-mapped reads have been discarded, even though this is not a DEG-based analysis? From my understanding, using these filtered data would help to avoid false positives, but I also realize that HTseq-count generates data for DE comparison between treatments (or in my case, time points) rather than for comparison between 2 or more genes. I want to be sure that I’m not drawing poor conclusions by using an inappropriately filtered version of my data that is useful for DEG analysis but not for other types of analyses.

Thanks in advance for any insights.

RNA-Seq clustering heatmap HTseq multi-mapped • 4.2k views

ADD COMMENT • link updated 5.8 years ago by LauferVA 4.2k • written 6.0 years ago by k.woolf ▴ 10

score 2 · Answer 1 · 2018-08-04

@k.woolf -

Because your goals still rely on quantitative comparisons, I would recommend you use either unique reads only, or that you use a published method of partial transcript estimation. To explore this, let's think about two simple approaches, and try to identify problems.

Consider a single read that does not map uniquely, or rather, that maps to multiple different genomic locations equally well.

Approach 1: You map a single read to two or more places.

Potential problem 1: you are in some way asserting there were actually two or more reads, not one, artificially increasing the count of one or more homologs that in reality did not have that read.

Approach 2: You map each unique read to only one location.

Potential problem 2: There is a possibility that you map it to the wrong homolog. This could incorrectly increase the score of one and decrease the score of another, introducing error into your data.

Other researchers have devised alternative approaches that attempt to mitigate problems with 1 and 2. Here, I think the notion of a partial transcript is likely to be of use to you. As described in this free text publication titled "ORMAN: Optimal resolution of ambiguous RNA-Seq multimappings in the presence of novel isoforms." several authors have proposed various ways of dividing a read amongst possible mapping locations. For example, EM algorithms to assign reads to one of several homologs have been developed, some of them simply divide the read (the read maps half here and half here, so assign a score of 1/2 to both).

For your specific problem, because there are likely to be a larger number of irrelevant genes compared to relevant genes. Thus, to avoid issues, you need to take a conservative approach, which is neither 1 nor 2 exactly. In particular, I'd select an approach that weights partial transcript estimates using other info from the seq experiment.

One final thought. There are new algorithms being developed specifically for scRNA seq data that iteratively derive estimates for gene transcription levels in individual cells. This methods could give you additional insight even if they do not address your data specifically.