Question

miRNA normalization workflow

1

Entering edit mode

3.7 years ago

lenC_biotecLover ▴ 90

Hi everyone, I'm a biotech-student and I'm currently approaching bioinformatics and computational biology. Here are some problems I don't know how to to deal with. I have downloaded some data from TCGA, related to BRCA: miRNA quantification data, for solid tumor samples and normal samples (NAT, I know); i want to perform a classification analyss using the decision tree model, considering miRNA expression values as features for the classification.

What type of data should I use? Normalized, like RPKM or TPM or not?
Do I need to normalize the data for what about the batch effect within each dataset? (separately for tumor samples and normal samples);
How can I approach to the data before applying the decision tree algorithm? I can't completely understand how to start preparing my data for what about the normalization, and I don't know which steps should I follow and how.

I am inexperienced for what concern the statistical approaches needed, this is my very first attempt, I searched a lot but I don't know which could be the best way to start. Any advice will be extremely appreciated, thank you so much in advance.

miRNA TCGA machine learning normalization R • 1.4k views

ADD COMMENT • link 3.7 years ago by lenC_biotecLover ▴ 90

score 3 · Answer 1 · 2020-08-18

3

Entering edit mode

3.7 years ago

ATpoint 81k

What type of data should I use? Normalized, like RPKM or TPM or not?

I would use vst from DESeq2. The vignette explains what it does. miRNA are not special, normal rules of RNA-seq normalization apply. Please read previous posts and the vignettes of DESeq2/edgeR on normalization.

Do I need to normalize the data for what about the batch effect within each dataset? (separately for tumor samples and normal samples);

If there is a batch effect and it can be corrected, then probably yes. Batches can only be corrected if samples of each group contain replicates with that batch, so intra-group removal (from what I understand) is not possible, nor desired since it affects one group but not the other, introducing a new batch effect. I hope your data are not fully confounded like tumor from one source and normals from a completely independent source/lab/workflow. Then you cannot correct for it any any effect you see can purely be driven by batch rather than biology. Batch correction with these published databases is often difficult since few or no metadata about the exact processing of the samples is available, so you have to explore whether there are batch effects (probably yes) and if you can identify them. Check the Bioconductor package sva on that matter.

How can I approach to the data before applying the decision tree algorithm? I can't completely understand how to start preparing my data for what about the normalization, and I don't know which steps should I follow and how.

If you are referring to normalization, you simply need a count matrix, rows are genes, columns are samples. With vst from DESeq2 it would be vst(your.matrix). More details in the DESeq2 vignette. Batch effects can be explored by PCA, again see vignette.

ADD COMMENT • link 3.7 years ago by ATpoint 81k

0

Entering edit mode

Thank you very much for your precious advice, so RPKM is not very good to use right?

ADD REPLY • link 3.7 years ago by lenC_biotecLover ▴ 90

2

Entering edit mode

Naive RPKM or any other per-million method (so only correcting for library size) might not be enough to correct for composition bias. If you want rpkm then be sure to use the implementations from edgeR or DESeq2 which use size factors beyond raw library scaling. Again, please check with their vignettes. vst has the advantage that it corrects for this mentioned issue and also introduces variance stabilization which might be helpful. It is commonly used, so you won't have trouble justifying it.

See here why proper normalization is necessary with regard to size factors:

ADD REPLY • link 3.7 years ago by ATpoint 81k

2

Entering edit mode

RPKM/FPKM and TPM also correct for gene length, which doesn't make sense for miRNAs, where are all (more or less) the same length.

ADD REPLY • link 3.7 years ago by i.sudbery 19k

0

Entering edit mode

Ok, thank you again for the clarification. Therefore, I should correct the batch only if I find it in both my datasets (normal and tumor)? I'm sorry for all these questions, really.

ADD REPLY • link 3.7 years ago by lenC_biotecLover ▴ 90