apply Negative Binomial Distribution (NBD) to ribosome profiling data.
Entering edit mode
9.0 years ago
xiangwulu ▴ 120


I want to apply Negative Binomial Distribution to my ribo-seq data simulation process in order to mimic the real data.

The reason of doing this is because I want to compare with the analysis and results of real human ribo-seq data, for my other part of the work.

I have:

  • a number of RefSeq human transcripts (e.g. the NM_ ) as the source of simulation
  • read length distribution from 26bp-32bp (derived from real ribo-seq data)

The real ribo-seq data would have a character that the footprint for transcripts will be different between each sub-codon position and reflect the correct Open Reading Frame. (e.g.

I thought the distribution would mainly reflect this.

But I am very confused where to start with, e.g. how to map the distribution model into my case. I wish someone would give me some hints or advises on this, thanks.

negative-binomial-distribution ribo-seq • 2.5k views
Entering edit mode

the NBD is usually applied to sum total read counts found at a gene. You can define that gene however you like, but we shouldn't be talking about codons or read lengths or even the number of transcripts. I think you're confusing several issues. Which of these numbers did you mean to simulate?

Entering edit mode

@karl hi, thanks for your reply. maybe I didn't explain clearly my problem, sorry about that.

The read length and no. of transcripts are secondary, there is no need to apply NBD here.

I think the codons or "the number of reads fall in different Open Reading Frame" is the question I am think about.

If the reads are randomly sampled, after align to the reference, the reads footprint could be like this:, (in SAM file, count number of alignment on each position, 3 colors means different reading frames ( +1, +2, +3))

But, ideally they are not just randomly fall across everywhere in the transcript, but they have high count on some positions, low counts or 0 counts on some other locations, e.g.

(in SAM file, count the number of alignment on each position, 3 reading frames are in 3 different plot)

Entering edit mode
8.9 years ago
xiangwulu ▴ 120

Sorry for the confusion in my question, I was confused for while too. Now I have figured it out.

Look at the plot:

Comparison of profiles from human ribo-seq real data and NBD sampled variates. In common, the footprint of real ribo-seq data (top plot) could have 0 in many positions, and there will be peaks and explicit (or implicit) triplet periodicity.

I want to do some tests with simulated ribo-seq data, and I want profile of simulated data looks like the real data (middle plot).

Not like this (data simulated with other RNA-seq simulator): http://

When I talked about the ORFs and codons, I meant that the profile of 3 separate frames in ORF would be different depending if it's translated (top plot: red, green, blue), so in the simulation, the data should be simulated separate for each individual frames (bottom plot), to reflect the real data (ideally).


Login before adding your answer.

Traffic: 3187 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6