BULK mRNA-seq with UMIs. Do I need to normalize by gene length?
Entering edit mode
9 weeks ago


I am analyzing some BULK mRNA-seq data that included UMIs during the sequencing.

I don't have much experience analyzing bulk RNA-seq, and it is my first time dealing with UMIs there. I have more experience in single-cell RNA-seq, and I thought that the concept of UMIs will translate directly between technologies (1 UMI = 1 mRNA molecule). But it clearly cannot. In bulk RNA-seq it seems that UMIs are used only for PCR-amplification correction. But even then, you still get much more counts in longer genes, leading to a gene-length bias even after removing PCR duplicates.

I would appreciate if you could help me correct some misconceptions and mistakes that I might be doing during my analyses.

My samples are treatment vs control across 4 timepoints. They were sequenced using:

mRNA sequencing (polyA enriched) using Illumina NovaSeq6000 sequencing, Paired-End, 150 bp.

My pipeline was:

  1. fastp to trim the reads
  2. umitools to add the UMI to the header of the fastq
  3. alignment with STAR
  4. umitools (dedup) to remove duplicates
  5. (when needed) merge multiple bams from a single sample into a single bam file
  6. htseq-count to generate a counts matrix
  7. DESeq2 analysis

When running the Differential Gene Expression analysis in DESeq2, it used its default normalization. As far as I have read, such normalization is enough, as deduplicating UMIs removed the PCR amplification bias. On top of that, as DESeq2 compares the same gene across samples, whatever gene-length bias present would be identical in all samples and "cancel out". Up to here, I think my analysis is good.

I also need to investigate an effect on gene length by treatment/length. DESeq2's normalization is not intended for comparisons WITHIN-sample, as it only applies a single normalization factor for the whole sample. I've also been reading about using CQN normalization. However, I am afraid that using it would be a mistake, as it calculates sample-specific gene length biases, which are exactly the thing that I am trying to measure.

My analysis so far has been:

  1. Divide the genes (after filtering by minimum expression & such) into 10 equal-sized bins by gene length.
  2. Calculate the log1p ( DESeq2 normalized counts ) per gene / sample. I'll call this log1p_norm.
  3. Calculate the "fraction of total transcripts" that the genes of each gene-length-bin represent among the total of the cell ( sum( log1p_norm[ genes in quantile ] ) / sum( log1p_norm[all genes] ) ) per each sample.
  4. Finally, compare these ratios between treatments and timepoints in a myriad of ways.

The mean expression level of the genes in each bin (DESeq2 normalized counts, before log1p trasnform) looks very similar, except for the two bins with the shortest genes.

Mean expression of genes in each bin (log scale)

edit: each of these groups/bins represent the mean expression across 24 samples of ~1,300 genes per group of increasing gene length. The samples are the same in all groups, but each gene is accounted for exclusively in a single group. This is not expression across the gene body.

However, the analyses I perform do never compare genes from 2 different bins. I simply compare the averages of bin X in condition 1 to the averages of the same bin in condition 2. Although I am not comparing exclusively a single gene between samples, can I apply the same logic as for DESeq2 normalization of comparing things (genes) with a bias with the same things with the same biases in a different sample, thus canceling out?

I think I am in the clear and this analysis could be good for publishing. Do you think so? Do you see any failure in the way to analyze this data? Or any bias that I am forgetting to correct?


mRNA UMI normalization illumina DESeq2 • 580 views
Entering edit mode

The original reason for seeking to normalize by gene length is that the same molecule could give rise to multiple fragments therefore leading to a bias in longer genes. Because you are using UMI, then (in a perfect world) each molecule can give rise to only a single count, obviating this concern. Therefore you are safe in not normalizing by gene length.

One way to check would be to quantify your bam files using raw read counts (rather than UMI) and verifying that this introduces a gene length bias.

Entering edit mode

Maybe there are different sorts of kits, but isn't the usual workflow that you still have mRNA/cDNA fragmentation first, and then ligate UMI-tagged adaptors to the fragmentated products, followed by PCR. So longer transcripts will still produce more fragments to be sequenced, and the UMI "just" removes the redundancy from PCR amplification?

@OP, basically what you're looking for is differential transcript usage and differential transcript expression, right? If not, please describe how the mode of action of the gene length bias would be introduced by the drug. Isn't it that just more longer or shorter isoforms are being expressed? Or is this some different mechanism? Does it require assembly of the transcriptome to pick up novel isoforms? I feel like the analysis you do is rather custom. Maybe there are existing approaches to use here if you describe the actual sci. question a bit more.

Entering edit mode

Yes, it depends whether you have performed fragmentation before or after amplification and the specific method of fragmentation and UMI adding. I'm used to seeing fragmentation occur after amplification.

Entering edit mode

I am not testing for isoforms, but a decline on mRNA abundance.

We are testing a drug that is supposed to slow down transcription. We hypothesize that longer genes should show a decline in abundance, as they take longer to transcribe, while shorter genes should be relatively impervious to its effect.

I am not looking at the expression through the gene body of each gene, but the total expression level (htseq-count , normalized by DESeq2) of each gene as a whole, grouping genes into 10 bins with the 0-10% shortest genes, 10-20%, ...90-100% longest (~1,300 genes per group). Thus, what we want to compare is changes in the relative abundance of short/long genes in the treated vs control samples at each timepoint. We don't care (at this point) about transcripts, but about #counts genes length X / #counts all genes per sample.

We did mRNA sequencing (polyA enriched) using Illumina NovaSeq6000 sequencing, Paired-End, 150 bp.

Apart of all this, I am indeed testing differential transcript usage with DEXSeq, but that analysis is beyond the scope of the question in this post.

Entering edit mode
9 weeks ago

In bulk RNA-seq, duplicated cDNA fragments may reflect true biological duplication arising from highly expressed genes. Adding a UMI to the adapter creates an individual fragment/adapter pair, such that any duplicates are clearly attributable to technical artefacts and can safely be removed. Technical artefacts may be PCR duplicates, but can also be exclusion amplification duplicates in patterned flow cells or optical duplicates in unpatterned flow cells.

However, the adapters may be added after the cDNA is created and sometimes also fragmented, and you should consult the manual of the kit used for library preparation regarding details of the method. For example, with QuantSeq from Lexogen, which produces just one fragment per transcript, no length normalization is required. However, this was clearly not used, because it has an extreme 3' (or 5') bias that is evidently absent from your data and many other kits require length normalization.

I suppose, the RNA was reverse-transcribed into cDNAs using a poly(T) primer, which would explain the moderate 3' prime positional bias in coverage? In that case, you need to normalize for length, considering the fragment length distribution. Maybe the easiest is to use salmon for quantification, because it employs a quite fancy error correction termed fragment–transcript agreement model and then use the relative abundances determined by salmon for the whole transcripts to normalize your binned counts?

Entering edit mode

Sorry if it was not clear in the figure I posted.

I am not analyzing expression across the gene body of individual genes. I am using the htseq-count results of the whole gene (1 value per whole gene per sample). After (basic) normalization I am grouping genes into 10 groups (~1,300 genes per group) by increasing gene length (shortest 0-10% genes,... longest 90-100% genes). And then comparing the relative abundance of each group in each sample / treatment / timepoint.

I honestly don't know anything about the UMI protocol used by the external company who did the sequencing. The only information I have is:

mRNA sequencing (polyA enriched) of 115 totalRNA samples (human), using Illumina NovaSeq6000 sequencing, Paired-End, 150 bp.

I have already sent an email to the sequencing company asking for details on the UMI protocol used.

Thanks for the link to Salmon. I'll give it a look!

Entering edit mode

Yes, the company needs to provide you with more information on the library preparation - also for the methods section of your future publication. With 150bp reads, your inner distance might even be negative, because fragments overlap.

I think using Salmon instead of htseq-count will save you a lot of headache here. You can also run RSeQC and QualiMap on your data to get some more insight into your the read distribution of your data. Both tools are supported by MultiQC, so it is quite convenient to get an overview over your 115 samples.

Regarding your scientific question, I propose quantifying the transcripts exclusively by their introns. This might lead nowhere, but restricting yourself to unspliced pre-mRNA will probably give you a better idea how well your drug actually works. Additionally, I would revise your binning strategy. Rather than comparing thousands of short genes, I would look at smaller groups that are normally co-regulated and uniformly change their expression. You probably have this information or can devise it from your single-cell RNA-seq. If you repeatedly see a "failure" of co-regulation in those groups for the longer genes, this might be an effect of your drug - however it will not be a simple uniform correlation with length, since transcriptional elongation doesn't happen with constant speed.

Both of my proposals are, however, a wild guess. So rather look at papers that have elaborated on transcriptional pausing, how they have analysed their data, e.g. Gajos et al. and Neuman et al. If you have the funds and time, looking specifically how and if your drug affects pausing with TV-PRO-seq might be interesting as well.


Login before adding your answer.

Traffic: 1075 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6