Question

BULK mRNA-seq with UMIs. Does it make sense to run bedtools genomecov?

0

Entering edit mode

2.3 years ago

txema.heredia ▴ 250

Hi,

I am using BULK rna-seq which includes UMI added after cDNA fragmentation, before PCR amplification (see my previous post BULK mRNA-seq with UMIs. Do I need to normalize by gene length? ).

I am trying to analyze the effects of a drug which might affect genes differently depending on their gene length, so I am trying to correct as much as possible the effects of gene length bias.

(I am totally aware of the existence of transcript quantification methods like salmon and kallisto. I am already using/planning to use them in parallel)

My input is: UMI-deduplicated splicing-aware mapped bam files. This should have a number of counts per gene proportional to real mRNA-molecule count gene/transcript length. After UMI-deduplication the gene-lenght bias caused by PCR amplification should* be gone. However, there still exists a gene-length bias driven by the number of fragments in longer genes.

Would it make sense to do the following:

bedtools genomecov -bg .... on each bam file. This should give me the number of UMI-deduplicated reads that overlap in the sequence. Intuitively, if 2 or more (UMI-deduplicated) reads overlap in the same position (for more than X bp), that would suggest that these 2 reads originated from fragments from 2 different mRNA molecules. (Because a single molecule's fragments shouldn't overlap, right?)

Get the highest number of overlapping reads/max coverage of each gene. Using a combination of bedtools intersect ... -wa -wb of the previous result with a .gtf annotation file, followed by bedtools merge ... -o max. This would give me, the highest value of the calculated coverage in each gene --> maximum overlap of reads in a single position within the gene --> X fragments originating from X actual real mRNA molecules detected in that gene.

I understand that this approach will underestimate the actual expression. You could have 10 molecules but happen to sequence only like 7 fragments overlapping in a given position.

Does this sound stupid? Do you have any idea on how to improve it?

PS: Yeah, I am planning to compare these results vs other quantification metrics like whole-gene htseq-count, htseq-count divided by gene length, kallisto quantification aggregating per gene.

RNAseq gene-length normalization • 704 views

ADD COMMENT • link 2.3 years ago by txema.heredia ▴ 250