Question

verification of CPM values

0

Entering edit mode

5.5 years ago

maryak ▴ 20

Is there any way to determine that have i calculated the right CPM values from htseq -counts data? i calculated CPM using R

RNA-Seq R • 3.6k views

ADD COMMENT • link 5.5 years ago by maryak ▴ 20

score 0 · Answer 1 · 2018-10-10

0

Entering edit mode

5.5 years ago

Kevin Blighe 87k

Hey, please paste the commands that you have used, right from the word 'go' (i.e. from the beginning of your script) and also what are your experiment goals?

ADD COMMENT • link 4.0 years ago by Kevin Blighe 87k

0

Entering edit mode

  library(edgeR)
    RNAseq2 <-read.delim("C:\\Users\\Com\\Desktop\\D1.tsv",header = TRUE)
    rnames <-RNAseq2[,1]
    MA <- data.matrix(RNAseq2[,2:4])
    dge <- DGEList(MA)
     getCounts(dge)
     cal <- calcNormFactors(dge,method = "TMM")
     RR <- cpm(cal,normalized.lib.sizes = FALSE)
     row.names(RR)<- rnames
      head(RR)
      write.table(RR,"C:\\Users\\hp folio\\Desktop\\TMM3.tsv",sep='\t',row.names=TRUE,col.names = TRUE)

My goal is to identify cancer related genes provided htseq-counts data

ADD REPLY • link 5.5 years ago by maryak ▴ 20

0

Entering edit mode

Okay, cool! The only thing that I would say is if you are aiming to derive data for downstream analyses, like network analysis, machine learning, et cetera, then you may want to add log=TRUE to your cpm() function in order to produce log [base 2] CPM.

Also, you should then plot the CPM data via the hist() function. It should roughly follow a normal distribution but may have 2 'humps'.

ADD REPLY • link 5.5 years ago by Kevin Blighe 87k

0

Entering edit mode

The histogram that i obtained has a long bar at the start and rest all of the bars are very small .The frequency of values ranging between 0 and 1 is higher??? is that what you mean by normal distribution??

ADD REPLY • link 5.5 years ago by maryak ▴ 20

0

Entering edit mode

No, does not sound normal at all. Sounds like a negative binomial. Did you specify log=TRUE?

ADD REPLY • link 5.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Yes.i did added log=true in cpm function

ADD REPLY • link 5.5 years ago by maryak ▴ 20

0

Entering edit mode

Without seeing your histogram, I cannot really comment further. You may increase the value of breaks that you pass to the histogram.

Edit: I also assume that you filtered out low count transcripts prior to normalisation. Is there any reason why your data would have very high low-count transcripts?

ADD REPLY • link 5.5 years ago by Kevin Blighe 87k

0

Entering edit mode

why are you so much emphasizing on normal distribution ?? why it is not ok if my normalized data have negative values?? i must have uploaded histogram if there is some option to upload image.May be negative values show down regulated and positive values show unregulated.

ADD REPLY • link 5.5 years ago by maryak ▴ 20

0

Entering edit mode

You have asked people to check your data distribution but you have not even shown it...

Take a look here: How to add images to a Biostars post

Some downstream programs require data to be normally-distributed; others do not. What is your goal?, i.e., what is your downstream analysis plan? Stating 'identify cancer related genes' is not sufficient information, unfortunately.

ADD REPLY • link 5.5 years ago by Kevin Blighe 87k

0

Entering edit mode

enter image description here

ADD REPLY • link 5.5 years ago by maryak ▴ 20

0

Entering edit mode

histogram

ADD REPLY • link 5.5 years ago by maryak ▴ 20

0

Entering edit mode

histogram

ADD REPLY • link 5.5 years ago by maryak ▴ 20

0

Entering edit mode

I have obtained the histogram. and i am going to classify (identify) cancer related genes using machine learning algorithm(PSO +SVM)

ADD REPLY • link 5.5 years ago by maryak ▴ 20

0

Entering edit mode

Cool - thank you. It really looks like you have a lot of low count transcripts, which is distorting the data distribution. The current distribution that you have is not suited for machine learning algorithms (assuming that the algorithms that you are aiming to use expect data to be normally-distributed - you, as the analyst, need to check this).

With your raw counts, did you do any pre-filtering for low count transcripts? For example, remove transcripts whose mean raw count < 10?

Going back a bit, you should also use normalized.lib.sizes = TRUE and could scale up your prior.count to be 1:

cpm(cal, normalized.lib.sizes = TRUE, prior.count = 1, log = TRUE)

ADD REPLY • link 5.5 years ago by Kevin Blighe 87k

0

Entering edit mode

can u please tell me how to pre -filter?

ADD REPLY • link 5.5 years ago by maryak ▴ 20

1

Entering edit mode

He is trying, but you aren't answering his questions.

Knowing the right distribution for your data is absolutely critical for any statistical analyses. If you data looks oddly distributed (which seemingly yours does), that might be indicative of bad data, or bad upstream data processing.

You cannot feed data in to a machine learning algorithm and magically expect results you want.

GARBAGE IN ---> GARBAGE OUT

ADD REPLY • link 5.5 years ago by Joe 21k

0

Entering edit mode

Thanks @jrj.healey!

To filter, you could try this:

# obtain mean raw count per gene
meanvals <- apply(MA, 1, function(x) mean(x, na.rm=TRUE))

# see how many genes have mean < 10
table(meanvals < 10)

# filter your data to only include transcripts with mean raw count >= 10
MA.filt <- MA[which(meanvals >= 10),]

Out of curiosity, what is the result of table(meanvals < 10) when you run the above code?

ADD REPLY • link 5.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Table(meanvals<10 ) returns me

FALSE  TRUE 
20536   39947

ADD REPLY • link 5.5 years ago by maryak ▴ 20

0

Entering edit mode

This is how my code looks like now

library(edgeR)
RNAseq2 <-read.delim("C:\\Users\\hp folio\\Desktop\\DLBC.tsv",header = TRUE)
rnames <-RNAseq2[,1]
#head(rnames)
MA <- data.matrix(RNAseq2[,2:49])
MAA  <- (2^MA)-1
meanvals <- apply(MAA, 1, function(MAA) mean(MAA, na.rm=TRUE))
table(meanvals < 10)
MAA.filt <- MAA.filt[which(meanvals >= 10),]
#hist(MAA)
#head (MAA)
dge <- DGEList(MAA)
dim(dge)
#getCounts(dge)
cal <- calcNormFactors(dge,method = "TMM")
RR <- cpm(cal,normalized.lib.sizes=TRUE ,log = TRUE,prior.count = 1)
hist(RR)
row.names(RR)<- rnames
#head(RR)
write.table(RR,"C:\\Users\\hp folio\\Desktop\\TMM3.tsv",sep='\t',row.names=TRUE,col.names = TRUE)

ADD REPLY • link 5.5 years ago by maryak ▴ 20

0

Entering edit mode

Thank you. There is the problem: You have ~40,000 genes that have low values.

I do not know what is your dataset, but it would make sense if these ~40,000 genes were non-coding RNAs or pseudogenes. The ~20,000 figure of genes with raw count >= 10 corresponds roughly to the number of protein coding genes in the human transcriptome.

You could reduce the count threshold to 5, if you want, or leave it at 10. Either way, these low count genes should be removed.

ADD REPLY • link 5.5 years ago by Kevin Blighe 87k

1

Entering edit mode

thanks a lot for your generous help

ADD REPLY • link 5.5 years ago by maryak ▴ 20

0

Entering edit mode

@jrj.healey please only comment if you have answer to my question..don,t state that, has already been stated.

ADD REPLY • link 5.5 years ago by maryak ▴ 20

1

Entering edit mode

maryam.akram001 as a moderator, it is part of my job to intervene in threads. I can, will, and do, comment wherever I please.

Your attitude is not conducive to receiving help from Kevin, or anyone else for that matter. Everyone on this site is volunteering their time, and if you hadn't noticed, its also currently the weekend. Your part, as a poster of a question, is to provide all the details needed to help resolve the issue, and make the lives of the people contributing answers as simple as possible.

Think again about how you interact on the forum, if you expect to keep receiving help...

ADD REPLY • link 5.5 years ago by Joe 21k