My idea is to estimate how much of a transcript is terminating at x position in the chromosome vs reading through
I have 2 matched datasets, conventional RNASeq and TermSeq (ligated adapter to RNA 3' end, sequencing library is enriched for only 3' termini)
I aligned both datasets with Bowtie2
I quantified TermSeq depth at all genomic positions with samtools depth
I counted reads aligning to each gene feature with HTSeq
I've tried to normalise the Term-Seq depth data to reads per million for each genomic position using this formula
RPM = depth/(sum(depth)/read_length*1000000)
I added read length to the denominator here because each read is counted across multiple positions.
I normalised the conventional RNASeq using the RPKM/TPM method (I tried both)
I then compared the TermSeq depth for a given 3' end to the RNASeq depth for the closest upstream gene (only allowing 50 bp max distance).
My logic is that after normalisation both measures are essentially a proportion of the total library. E.g. IF
0.2% of total RNA is gene A
0.1% of total transcript 3' ends are located immediately downstream of gene A
then 50% transcripts at that position have their 3' termini at that position
However the numbers I'm getting for % termination are too high (mean 150%, max 1115%) so I have gone wrong somewhere
Any thoughts on this? Wrong normalisation method? Not possible to do this because RNASeq gives only relative quantification? Not possible to account for the difference is read distribution - full transcripts vs termini only?