I am analyzing some BULK mRNA-seq data that included UMIs during the sequencing.
I don't have much experience analyzing bulk RNA-seq, and it is my first time dealing with UMIs there. I have more experience in single-cell RNA-seq, and I thought that the concept of UMIs will translate directly between technologies (1 UMI = 1 mRNA molecule). But it clearly cannot. In bulk RNA-seq it seems that UMIs are used only for PCR-amplification correction. But even then, you still get much more counts in longer genes, leading to a gene-length bias even after removing PCR duplicates.
I would appreciate if you could help me correct some misconceptions and mistakes that I might be doing during my analyses.
My samples are treatment vs control across 4 timepoints. They were sequenced using:
mRNA sequencing (polyA enriched) using Illumina NovaSeq6000 sequencing, Paired-End, 150 bp.
My pipeline was:
- fastp to trim the reads
- umitools to add the UMI to the header of the fastq
- alignment with STAR
- umitools (dedup) to remove duplicates
- (when needed) merge multiple bams from a single sample into a single bam file
- htseq-count to generate a counts matrix
- DESeq2 analysis
When running the Differential Gene Expression analysis in DESeq2, it used its default normalization. As far as I have read, such normalization is enough, as deduplicating UMIs removed the PCR amplification bias. On top of that, as DESeq2 compares the same gene across samples, whatever gene-length bias present would be identical in all samples and "cancel out". Up to here, I think my analysis is good.
I also need to investigate an effect on gene length by treatment/length. DESeq2's normalization is not intended for comparisons WITHIN-sample, as it only applies a single normalization factor for the whole sample. I've also been reading about using CQN normalization. However, I am afraid that using it would be a mistake, as it calculates sample-specific gene length biases, which are exactly the thing that I am trying to measure.
My analysis so far has been:
- Divide the genes (after filtering by minimum expression & such) into 10 equal-sized bins by gene length.
- Calculate the log1p ( DESeq2 normalized counts ) per gene / sample. I'll call this log1p_norm.
- Calculate the "fraction of total transcripts" that the genes of each gene-length-bin represent among the total of the cell (
sum( log1p_norm[ genes in quantile ] ) / sum( log1p_norm[all genes] )) per each sample.
- Finally, compare these ratios between treatments and timepoints in a myriad of ways.
The mean expression level of the genes in each bin (DESeq2 normalized counts, before log1p trasnform) looks very similar, except for the two bins with the shortest genes.
edit: each of these groups/bins represent the mean expression across 24 samples of ~1,300 genes per group of increasing gene length. The samples are the same in all groups, but each gene is accounted for exclusively in a single group. This is not expression across the gene body.
However, the analyses I perform do never compare genes from 2 different bins. I simply compare the averages of bin X in condition 1 to the averages of the same bin in condition 2. Although I am not comparing exclusively a single gene between samples, can I apply the same logic as for DESeq2 normalization of comparing things (genes) with a bias with the same things with the same biases in a different sample, thus canceling out?
I think I am in the clear and this analysis could be good for publishing. Do you think so? Do you see any failure in the way to analyze this data? Or any bias that I am forgetting to correct?