I am analyzing some BULK mRNA-seq data that included UMIs during the sequencing.
I don't have much experience analyzing bulk RNA-seq, and it is my first time dealing with UMIs there. I have more experience in single-cell RNA-seq, and I thought that the concept of UMIs will translate directly between technologies (1 UMI = 1 mRNA molecule). But it clearly cannot. In bulk RNA-seq it seems that UMIs are used only for PCR-amplification correction. But even then, you still get much more counts in longer genes, leading to a gene-length bias even after removing PCR duplicates.
I would appreciate if you could help me correct some misconceptions and mistakes that I might be doing during my analyses.
My samples are treatment vs control across 4 timepoints. They were sequenced using:
mRNA sequencing (polyA enriched) using Illumina NovaSeq6000 sequencing, Paired-End, 150 bp.
My pipeline was:
- fastp to trim the reads
- umitools to add the UMI to the header of the fastq
- alignment with STAR
- umitools (dedup) to remove duplicates
- (when needed) merge multiple bams from a single sample into a single bam file
- htseq-count to generate a counts matrix
- DESeq2 analysis
When running the Differential Gene Expression analysis in DESeq2, it used its default normalization. As far as I have read, such normalization is enough, as deduplicating UMIs removed the PCR amplification bias. On top of that, as DESeq2 compares the same gene across samples, whatever gene-length bias present would be identical in all samples and "cancel out". Up to here, I think my analysis is good.
I also need to investigate an effect on gene length by treatment/length. DESeq2's normalization is not intended for comparisons WITHIN-sample, as it only applies a single normalization factor for the whole sample. I've also been reading about using CQN normalization. However, I am afraid that using it would be a mistake, as it calculates sample-specific gene length biases, which are exactly the thing that I am trying to measure.
My analysis so far has been:
- Divide the genes (after filtering by minimum expression & such) into 10 equal-sized bins by gene length.
- Calculate the log1p ( DESeq2 normalized counts ) per gene / sample. I'll call this log1p_norm.
- Calculate the "fraction of total transcripts" that the genes of each gene-length-bin represent among the total of the cell (
sum( log1p_norm[ genes in quantile ] ) / sum( log1p_norm[all genes] )) per each sample.
- Finally, compare these ratios between treatments and timepoints in a myriad of ways.
The mean expression level of the genes in each bin (DESeq2 normalized counts, before log1p trasnform) looks very similar, except for the two bins with the shortest genes.
edit: each of these groups/bins represent the mean expression across 24 samples of ~1,300 genes per group of increasing gene length. The samples are the same in all groups, but each gene is accounted for exclusively in a single group. This is not expression across the gene body.
However, the analyses I perform do never compare genes from 2 different bins. I simply compare the averages of bin X in condition 1 to the averages of the same bin in condition 2. Although I am not comparing exclusively a single gene between samples, can I apply the same logic as for DESeq2 normalization of comparing things (genes) with a bias with the same things with the same biases in a different sample, thus canceling out?
I think I am in the clear and this analysis could be good for publishing. Do you think so? Do you see any failure in the way to analyze this data? Or any bias that I am forgetting to correct?
The original reason for seeking to normalize by gene length is that the same molecule could give rise to multiple fragments therefore leading to a bias in longer genes. Because you are using UMI, then (in a perfect world) each molecule can give rise to only a single count, obviating this concern. Therefore you are safe in not normalizing by gene length.
One way to check would be to quantify your bam files using raw read counts (rather than UMI) and verifying that this introduces a gene length bias.
Maybe there are different sorts of kits, but isn't the usual workflow that you still have mRNA/cDNA fragmentation first, and then ligate UMI-tagged adaptors to the fragmentated products, followed by PCR. So longer transcripts will still produce more fragments to be sequenced, and the UMI "just" removes the redundancy from PCR amplification?
@OP, basically what you're looking for is differential transcript usage and differential transcript expression, right? If not, please describe how the mode of action of the gene length bias would be introduced by the drug. Isn't it that just more longer or shorter isoforms are being expressed? Or is this some different mechanism? Does it require assembly of the transcriptome to pick up novel isoforms? I feel like the analysis you do is rather custom. Maybe there are existing approaches to use here if you describe the actual sci. question a bit more.
Yes, it depends whether you have performed fragmentation before or after amplification and the specific method of fragmentation and UMI adding. I'm used to seeing fragmentation occur after amplification.
I am not testing for isoforms, but a decline on mRNA abundance.
We are testing a drug that is supposed to slow down transcription. We hypothesize that longer genes should show a decline in abundance, as they take longer to transcribe, while shorter genes should be relatively impervious to its effect.
I am not looking at the expression through the gene body of each gene, but the total expression level (htseq-count , normalized by DESeq2) of each gene as a whole, grouping genes into 10 bins with the 0-10% shortest genes, 10-20%, ...90-100% longest (~1,300 genes per group). Thus, what we want to compare is changes in the relative abundance of short/long genes in the treated vs control samples at each timepoint. We don't care (at this point) about transcripts, but about
#counts genes length X / #counts all genesper sample.
We did mRNA sequencing (polyA enriched) using Illumina NovaSeq6000 sequencing, Paired-End, 150 bp.
Apart of all this, I am indeed testing differential transcript usage with DEXSeq, but that analysis is beyond the scope of the question in this post.