Question

CPAT: error on coding probability cutoff

0

Entering edit mode

2.4 years ago

xiaoxia8923 • 0

Dear all,

I am trying to find the lncRNA using the CPAT on dairy cows. To be able to determine the coding probability cutoff, I followed "How to choose cutoff" to generate the training dataset:

Here is how I did: Step 1: make_hexamer_tab.py -c /storage/users/xdai/ref/cow/Bos_taurus.ARS-UCD1.2.cds.all.fa -n /storage/users/xdai/ref/cow/Bos_taurus.ARS-UCD1.2.ncrna.fa > Bos_taurus_Hexamer.tsv

Step 2: make_logitModel.py -x Bos_taurus_Hexamer.tsv -c /storage/users/xdai/ref/cow/Bos_taurus.ARS-UCD1.2.cdna.all.fa.gz -n /storage/users/xdai/ref/cow/Bos_taurus.ARS-UCD1.2.ncrna.fa -o Bos_taurus

The cds file I download from Ensembl. The known coding protein-coding (cdna) and unknown protein-coding (ncrna), I all downloaded from Enseml. Based on the previous step, I generate the required train dataset with the heading of "names(data)[1]: "ID" "mRNA" "ORF" "Fickett" "Hexamer" "Label" (The same as shown on the website)

Then I used "10Fold_CrossValidation.r" that I download from the CPAT website to generate figure 3, to decide the cutoff coding potential value. In the step of "pred <- prediction(ROCR_data$predictions, ROCR_data$Labels)", it showed the following error:

Error in prediction(ROCR_data$predictions, ROCR_data$Labels): Number of classes is not equal to 2.ROCR currently supports only evaluation of binary classification tasks.

I open the generate "test1.xls" and found the labels all equal to "1". The original loaded data (trained dataset) has both "0" and "1". I did not change the code of "10Fold_CrossValidation.r". I have no idea what is going on. Could anyone please advise what is wrong with my steps and suggestions to fix this problem?

Many thanks.

lncRNA RNAseq CPAT • 897 views

ADD COMMENT • link updated 8 months ago by catarinaventura_pt • 0 • written 2.4 years ago by xiaoxia8923 • 0

score 0 · Answer 1 · 2022-03-07

Updates: I got very great help from Dr Wang. My trained data is not balanced between coding and non-coding (4562 0s and 37988 1s). And all the coding genes are clustered together. Therefore, according to the suggestion of Dr Wang, I should shuffle my coding and non-coding data before running the R script. At the same time, I have more than 20,000 genes (the total genes in the "10Fold_CrossValidation.r"). Before running the R script, I will need to split my data into 10 data sets equally. So after this step. The errors in the prediction steps were gone.

However, I am facing other problems: 1.perf <- performance(pred,"tpr","fpr") Error in stats::approxfun(x.values.1, y.values.1, method = "constant", : zero non-NA points 2.d=performance(pred,measure="prec", x.measure="rec") Error in stats::approxfun(x.values.1, y.values.1, method = "constant", : zero non-NA points 3.plot(S,lwd=2,avg="vertical",add=TRUE,col="blue") Error in stats::approxfun(perf@x.values[[i]], perf@y.values[[i]], ties = mean, : need at least two non-NA values to interpolate 4.plot(P,lwd=2,avg="vertical",add=TRUE,col="red") Error in stats::approxfun(perf@x.values[[i]], perf@y.values[[i]], ties = mean, : need at least two non-NA values to interpolate

I am still looking for suggestion on solving it. Cheers.