Question: verification of CPM values
0
gravatar for maryam.akram001
4 days ago by
maryam.akram00110 wrote:

Is there any way to determine that have i calculated the right CPM values from htseq -counts data? i calculated CPM using R

rna-seq R • 181 views
ADD COMMENTlink written 4 days ago by maryam.akram00110
0
gravatar for Kevin Blighe
4 days ago by
Kevin Blighe30k
Republic of Ireland
Kevin Blighe30k wrote:

Paste the commands that you have used, right from the word 'go' (i.e. from the start) and also what your experiment goals are.

ADD COMMENTlink written 4 days ago by Kevin Blighe30k
  library(edgeR)
    RNAseq2 <-read.delim("C:\\Users\\Com\\Desktop\\D1.tsv",header = TRUE)
    rnames <-RNAseq2[,1]
    MA <- data.matrix(RNAseq2[,2:4])
    dge <- DGEList(MA)
     getCounts(dge)
     cal <- calcNormFactors(dge,method = "TMM")
     RR <- cpm(cal,normalized.lib.sizes = FALSE)
     row.names(RR)<- rnames
      head(RR)
      write.table(RR,"C:\\Users\\hp folio\\Desktop\\TMM3.tsv",sep='\t',row.names=TRUE,col.names = TRUE)

My goal is to identify cancer related genes provided htseq-counts data

ADD REPLYlink written 4 days ago by maryam.akram00110

Okay, cool! The only thing that I would say is if you are aiming to derive data for downstream analyses, like network analysis, machine learning, et cetera, then you may want to add log=TRUE to your cpm() function in order to produce log [base 2] CPM.

Also, you should then plot the CPM data via the hist() function. It should roughly follow a normal distribution but may have 2 'humps'.

ADD REPLYlink written 4 days ago by Kevin Blighe30k

The histogram that i obtained has a long bar at the start and rest all of the bars are very small .The frequency of values ranging between 0 and 1 is higher??? is that what you mean by normal distribution??

ADD REPLYlink written 3 days ago by maryam.akram00110

No, does not sound normal at all. Sounds like a negative binomial. Did you specify log=TRUE?

ADD REPLYlink written 3 days ago by Kevin Blighe30k

Yes.i did added log=true in cpm function

ADD REPLYlink written 3 days ago by maryam.akram00110

Without seeing your histogram, I cannot really comment further. You may increase the value of breaks that you pass to the histogram.

Edit: I also assume that you filtered out low count transcripts prior to normalisation. Is there any reason why your data would have very high low-count transcripts?

ADD REPLYlink modified 2 days ago • written 2 days ago by Kevin Blighe30k

why are you so much emphasizing on normal distribution ?? why it is not ok if my normalized data have negative values?? i must have uploaded histogram if there is some option to upload image.May be negative values show down regulated and positive values show unregulated.

ADD REPLYlink modified 2 days ago • written 2 days ago by maryam.akram00110

You have asked people to check your data distribution but you have not even shown it...

Take a look here: How to add images to a Biostars post

Some downstream programs require data to be normally-distributed; others do not. What is your goal?, i.e., what is your downstream analysis plan? Stating 'identify cancer related genes' is not sufficient information, unfortunately.

ADD REPLYlink modified 2 days ago • written 2 days ago by Kevin Blighe30k

enter image description here

ADD REPLYlink written 2 days ago by maryam.akram00110

histogram

ADD REPLYlink written 2 days ago by maryam.akram00110

histogram Capture

ADD REPLYlink written 2 days ago by maryam.akram00110

I have obtained the histogram. and i am going to classify (identify) cancer related genes using machine learning algorithm(PSO +SVM)

ADD REPLYlink written 2 days ago by maryam.akram00110

Cool - thank you. It really looks like you have a lot of low count transcripts, which is distorting the data distribution. The current distribution that you have is not suited for machine learning algorithms (assuming that the algorithms that you are aiming to use expect data to be normally-distributed - you, as the analyst, need to check this).

With your raw counts, did you do any pre-filtering for low count transcripts? For example, remove transcripts whose mean raw count < 10?

Going back a bit, you should also use normalized.lib.sizes = TRUE and could scale up your prior.count to be 1:

cpm(cal, normalized.lib.sizes = TRUE, prior.count = 1, log = TRUE)
ADD REPLYlink written 2 days ago by Kevin Blighe30k

can u please tell me how to pre -filter?

ADD REPLYlink written 1 day ago by maryam.akram00110
1

He is trying, but you aren't answering his questions.

Knowing the right distribution for your data is absolutely critical for any statistical analyses. If you data looks oddly distributed (which seemingly yours does), that might be indicative of bad data, or bad upstream data processing.

You cannot feed data in to a machine learning algorithm and magically expect results you want.

GARBAGE IN ---> GARBAGE OUT
ADD REPLYlink written 1 day ago by jrj.healey7.4k

Thanks @jrj.healey!

To filter, you could try this:

# obtain mean raw count per gene
meanvals <- apply(MA, 1, function(x) mean(x, na.rm=TRUE))

# see how many genes have mean < 10
table(meanvals < 10)

# filter your data to only include transcripts with mean raw count >= 10
MA.filt <- MA[which(meanvals >= 10),]

Out of curiosity, what is the result of table(meanvals < 10) when you run the above code?

ADD REPLYlink modified 1 day ago • written 1 day ago by Kevin Blighe30k

Table(meanvals<10 ) returns me

FALSE  TRUE 
20536   39947
ADD REPLYlink written 1 day ago by maryam.akram00110

This is how my code looks like now

library(edgeR)
RNAseq2 <-read.delim("C:\\Users\\hp folio\\Desktop\\DLBC.tsv",header = TRUE)
rnames <-RNAseq2[,1]
#head(rnames)
MA <- data.matrix(RNAseq2[,2:49])
MAA  <- (2^MA)-1
meanvals <- apply(MAA, 1, function(MAA) mean(MAA, na.rm=TRUE))
table(meanvals < 10)
MAA.filt <- MAA.filt[which(meanvals >= 10),]
#hist(MAA)
#head (MAA)
dge <- DGEList(MAA)
dim(dge)
#getCounts(dge)
cal <- calcNormFactors(dge,method = "TMM")
RR <- cpm(cal,normalized.lib.sizes=TRUE ,log = TRUE,prior.count = 1)
hist(RR)
row.names(RR)<- rnames
#head(RR)
write.table(RR,"C:\\Users\\hp folio\\Desktop\\TMM3.tsv",sep='\t',row.names=TRUE,col.names = TRUE)
ADD REPLYlink written 1 day ago by maryam.akram00110

Thank you. There is the problem: You have ~40,000 genes that have low values.

I do not know what is your dataset, but it would make sense if these ~40,000 genes were non-coding RNAs or pseudogenes. The ~20,000 figure of genes with raw count >= 10 corresponds roughly to the number of protein coding genes in the human transcriptome.

You could reduce the count threshold to 5, if you want, or leave it at 10. Either way, these low count genes should be removed.

ADD REPLYlink written 1 day ago by Kevin Blighe30k
1

thanks a lot for your generous help

ADD REPLYlink written 1 day ago by maryam.akram00110

@jrj.healey please only comment if you have answer to my question..don,t state that, has already been stated.

ADD REPLYlink written 1 day ago by maryam.akram00110
1

maryam.akram001 as a moderator, it is part of my job to intervene in threads. I can, will, and do, comment wherever I please.

Your attitude is not conducive to receiving help from Kevin, or anyone else for that matter. Everyone on this site is volunteering their time, and if you hadn't noticed, its also currently the weekend. Your part, as a poster of a question, is to provide all the details needed to help resolve the issue, and make the lives of the people contributing answers as simple as possible.

Think again about how you interact on the forum, if you expect to keep receiving help...

ADD REPLYlink written 1 day ago by jrj.healey7.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1481 users visited in the last hour