Question: apply Negative Binomial Distribution (NBD) to ribosome profiling data.
4.9 years ago
xiangwulu80 wrote:


I want to apply Negative Binomial Distribution to my ribo-seq data simulation process in order to mimic the real data. 

The reason of doing this is because I want to compare with the analysis and results of real human ribo-seq data, for my other part of the work. 

I have: 

- a number of RefSeq human transcripts (e.g. the NM_ ) as the source of simulation

- read length distribution from 26bp-32bp (derived from real ribo-seq data)

The real ribo-seq data would have a character that the footprint for transcripts will be different between each sub-codon position and reflect the correct Open Reading Frame. (e.g.

I thought the distribution would mainly reflect this. 

But I am very confused where to start with, e.g. how to map the distribution model into my case. I wish someone would give me some hints or advises on this, thanks. 

ADD COMMENT

the NBD is usually applied to sum total read counts found at a gene. You can define that gene however you like, but we shouldn't be talking about codons or read lengths or even the number of transcripts. I think you're confusing several issues.  Which of these numbers did you mean to simulate? 

ADD REPLY

@karl hi, thanks for your reply. maybe i didn't explain clearly my problem, sorry about that. 

the read length and no. of transcripts are secondary, there is no need to apply NBD here. 

i think the codons or "the number of reads fall in different Open Reading Frame" is the question I am think about. 

if the reads are randomly sampled, after align to the reference, the reads footprint could be like this:, (in SAM file, count number of alignment on each position, 3 colors means different reading frames ( +1, +2, +3))

but, idealy they are not just randomly fall across everywhere in the transcript, but they have high count on some positions, low counts or 0 counts on some other locations, e.g.

(in SAM file, count the number of alignment on each position, 3 reading frames are in 3 different plot)




ADD REPLY
4.7 years ago
xiangwulu80 wrote:

Sorry for the confusion in my question, I was confused for while too. Now I have figured it out. 

Look at the plot :

Comparison of profiles from human ribo-seq real data and NBD sampled variates.  In common, the footprint of real ribo-seq data (top plot) could have 0 in many positions, and there will be peaks and explicit (or implicit) triplet periodicity. 

I want to do some tests with simulated ribo-seq data, and I want profile of simulated data looks like the real data (middle plot).

Not like this (data simulated with other RNA-seq simulator):  http://

When I talked about the ORFs and codons, I meant that the profile of 3 separate frames in ORF would be different depending if it's translated (top plot: red, green, blue), so in the simulation, the data should be simulated separate for each individual frames (bottom plot), to reflect the real data (ideally). 

ADD COMMENT
