I am using BULK rna-seq which includes UMI added after cDNA fragmentation, before PCR amplification (see my previous post BULK mRNA-seq with UMIs. Do I need to normalize by gene length? ).
I am trying to analyze the effects of a drug which might affect genes differently depending on their gene length, so I am trying to correct as much as possible the effects of gene length bias.
(I am totally aware of the existence of transcript quantification methods like salmon and kallisto. I am already using/planning to use them in parallel)
My input is: UMI-deduplicated splicing-aware mapped bam files. This should have a number of counts per gene proportional to real mRNA-molecule count gene/transcript length. After UMI-deduplication the gene-lenght bias caused by PCR amplification should* be gone. However, there still exists a gene-length bias driven by the number of fragments in longer genes.
Would it make sense to do the following:
bedtools genomecov -bg ....on each bam file. This should give me the number of UMI-deduplicated reads that overlap in the sequence. Intuitively, if 2 or more (UMI-deduplicated) reads overlap in the same position (for more than X bp), that would suggest that these 2 reads originated from fragments from 2 different mRNA molecules. (Because a single molecule's fragments shouldn't overlap, right?)
- Get the highest number of overlapping reads/max coverage of each gene.
Using a combination of
bedtools intersect ... -wa -wbof the previous result with a .gtf annotation file, followed by
bedtools merge ... -o max. This would give me, the highest value of the calculated coverage in each gene --> maximum overlap of reads in a single position within the gene --> X fragments originating from X actual real mRNA molecules detected in that gene.
I understand that this approach will underestimate the actual expression. You could have 10 molecules but happen to sequence only like 7 fragments overlapping in a given position.
Does this sound stupid? Do you have any idea on how to improve it?
PS: Yeah, I am planning to compare these results vs other quantification metrics like whole-gene htseq-count, htseq-count divided by gene length, kallisto quantification aggregating per gene.