Question

CCLE miRNA expression data

0

Entering edit mode

4.9 years ago

baris.blknt • 0

Hi,

Recently, CCLE released a miRNA expression data. I was looking for the normalization method of this miRNA expression data but I couldn't find. When I sum up all miRNA's expression values for a sample(cell line), I realized that variation is very high between cell lines like unnormalized data. Does this data need any normalization and do you have any suggestion?

You can find the expression data with following link. https://portals.broadinstitute.org/ccle/data

Best,

CCLE miRNA Nanostring • 3.6k views

ADD COMMENT • link updated 3.5 years ago by venugopal887 • 0 • written 4.9 years ago by baris.blknt • 0

0

Entering edit mode

Hi. I am new to CCLE data, at present it's very valuable information for me. I was struggling for a week to crack miRNA from CCLE. Could you please provide anything that I could understand and processing. I didn't understand data type also. What type of data in CCLE like( raw read counts) in that file.

Please help me if you find my message. Thank you very much This is my email i.d venugopal887@gmail.com for any time.

ADD REPLY • link 3.5 years ago by venugopal887 • 0

score 0 · Answer 1 · 2019-05-20

0

Entering edit mode

4.9 years ago

shawn.w.foley ★ 1.3k

From Ghandi et al. (2019) Nature it looks like the miRNAs were measured via Nanostring and normalized using the nSolver software, they don't go into too much detail, but the Methods section states:

Samples were divided into 14 batches, and two replicates of the K-562 cell line were included in each batch as a control. Internal positive and negative controls were used for normalization as recommended by NanoString using NanoString nSolver software. We excluded samples that failed NanoString nSolver quality control as well as one sample based on low positive control signal (normalization coefficient >6) and another sample based on high background signal (with second ranked negative control value >80). To estimate the background signal, we sorted the values for the negative controls within each sample and picked the second highest value as the background estimate. The median background estimate across all cell lines was 26.1. We used log(50 + N), in which N is the nSolver normalized value to reduce the effect of the background signal in the downstream analyses.

ADD COMMENT • link 4.9 years ago by shawn.w.foley ★ 1.3k

0

Entering edit mode

Shawn, thank you for the reply. However, I want to ask this: when I sum up all miRNA expression values for each cell line, I observed 9-fold difference between some cell lines. Do you think it is normal?

ADD REPLY • link 4.9 years ago by baris.blknt • 0

1

Entering edit mode

That's definitely a bit of a red flag, so I started to dig into the data a bit. In my hands, the extreme data ranges seem to be outliers. I reformatted and read in the miRNA data (each column is a cell line, each row is a miRNA), and I found:

mir <- read.table('CCLE_miRNA_20181103.reformat.gct',header=T,row.names=1,as.is=T,sep='\t')
cellLines <- colSums(mir)
quantile(cellLines,probs=seq(0,1,0.1))
        0%        10%        20%        30%        40%        50%        60% 
  36710.01  137371.74  163233.40  192464.58  214991.96  244239.64  275519.66 
       70%        80%        90%       100% 
 311112.12  371830.14  465336.89 1218169.48 
465336.89/137371.74
[1] 3.387428

So, while there's a large range of expression, there's only a 3.4-fold difference between the 10% and 90% quantiles. Additionally, if you plot the log of these data as plot(density(log10(cellLines))) it'll generate an approximately normal curve. So it does appear that the large variance is occurring at the extreme ends of the spectrum.

The paper also specified that they normalized to the Nanostring positive and negative controls. If the miRNA panel is like gene expression panels I've analyze then it includes a set of spike in and endogenous standards to normalize for RNA input. Everything seems above board to me.