Question

Determining gene expression with 2 biological replicates and RPKM data only

1

Entering edit mode

9.2 years ago

Kristin Muench ▴ 620

A friend asked me to help him figure out if a certain group of genes is expressed in some RNA-Seq data he found on GEO. The database only provides RPKM data, so I cannot feed the raw count data or .bam files into DESeq, edgeR, or cuffdiff. Also, there are only two biological replicates.

I know this isn't an ideal situation, but is there anything I can tell him/anything I can do to ballpark with reasonable confidence which genes are expressed?

I was thinking of doing something like: 1. log2 of RPKM data 2. Observe expression values for housekeeping genes 3. Set a 'this counts as expression' threshold below the housekeeping gene values 3. Look for where the expression values for the genes of interest fall relative to the threshold...of course, this doesn't account for noise at all.

How can I try to characterize the noise with so few biological replicates? Should I try to calculate fold changes relative to housekeeping genes? Is all hope lost unless I can get the raw count data?

Thank you for your insight!

RPKM RNA-Seq • 4.8k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Kristin Muench ▴ 620

Ram · Answer 1 · 2015-01-29

2

Entering edit mode

9.2 years ago

TriS ★ 4.7k

I believe you refer to only one condition, right? how many genes is your friend looking at?

You could try to run bootstrapping analysis, that will tell you if your set of genes has higher/lower expression than random sets of genes of the same size of yours (I'd log the data before starting since RPKM data are heavily skewed on the left). you can do that in R, there is even a boot function, or you can write your own using sample(), replicate(), mean() and median():

calculate the mean/median expression of your group of genes
randomly sample 10.000 times a set of genes of the same size of your friend's signature (i.e. if you have 100 genes in two replicates (=200 values) than the sampled data will contain 200 values, or you can get the mean of the two replicates and use the merged data, in this case you will extract sets of 100 values)
at each sampling extract the mean/median expression of the genes (i.e. 10k times = 10k mean/medians)
plot the distribution of the mean/median of the data you bootstrapped (histogram made of 10k observations), those are your "expected" values
add a line that indicates the expression of your initial gene list, those are the observed values
calculate empirical pvalue by: (# observations on the left/right of your observed data)+1/(10.001)

theoretically the results from mean and median should be pretty similar...

you might even want to check the initial distribution of the signature that you have, if it's uniform great, otherwise if you get a bimodal distribution you could divide your bootstrapping in two, for 1) the lowly expressed genes and 2) the highly expressed genes. this will tell you if the expression of 1) is lower than random sets of genes and if 2) is higher than random sets of genes...but I need to double think this, getting late and sleepy...

...anyway...hopefully this helps :)

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.2 years ago by TriS ★ 4.7k

0

Entering edit mode

That's correct - just one condition.

Thank you for the wonderful, detailed suggestion! I'll try it out!

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Kristin Muench ▴ 620

0

Entering edit mode

@Kristin Muench did you do these steps in R? If so, do you mind to share the code?

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 7.7 years ago by Learner ▴ 280

0

Entering edit mode

I have two follow-up questions:

My understanding is that this technique is to see if the whole set of genes (that is, the mean of the set of genes I'm looking at - turns out only 4 genes :O) is, on average, expressed above or below what would be expected by random chance.

What would I do if I wanted to see if only one gene was expressed by random chance? Instead of bootstrapping, would I just use a t-test to see if the expression of my gene was in an extreme tail of expression values for my full dataset?

Also: most (over half!) of my genes have an expression value of 0, so that the distribution of the RPKM data looks like a normal distribution with a gigantic tower built at 0 to the left of it. I should still include all those genes with expression=0 in my bootstrap, right?

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Kristin Muench ▴ 620

0

Entering edit mode

And one final comment:

I found that there were 880 bootstrap-produced means greater than the mean of my four genes of interest (which, in turn, are the averages of two samples), and 9120 bootstrap-produced means less than the mean of my four genes of interest. Using the above formula, I got an empirical p-value of 0.0001096 . This seems so high, given that my observed mean is only in the top 10% of bootstrap-produced means! How should I interpret this number?

Thank you again for any insight!

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Kristin Muench ▴ 620