Data transformation/normalization/scaling for machine learning using both RNA-Seq and Microarray data
2
1
Entering edit mode
3.9 years ago

Hello all,

We are trying to use Microarray and RNA-Seq gene expression datasets of the same type of cancer into a Machine Learning pipeline and we are looking for a method that could align (or something close to that) the different dataset medians and interquartile ranges. They're already well normalized individually, so we want to extend it to a cross-dataset normalization. Ideally, we want to use a training data example to fit some sort of scaler and use it for transforming the new testing dataset into a compatible dataset (similar range and median) that would be used for classification testing. The final model should be able to classify either Microarray transformed data and RNA-Seq transformed data.

We already tried some approaches like quantile transformation (sklearn.preprocessing.QuantileTransformer), non-paranormal transformation (either with shrunken ECDF and truncated ECDF - available on huge R package), normal distribution mapping by Yeo-Johnson transform method that accepts 0 values (sklearn.preprocessing.PowerTransformer), standardization/mean removal (sklearn.preprocessing.StandardScaler), the normalization method available on scikit-learn with l2 norm (sklearn.preprocessing.normalize) and even a simple min-max scaling, which strangely showed one of the best cross-dataset test performances. However, none of the methods succeeded in aligning the medians in a separated process (data transformation of each dataset independently), and just some of them transformed the data to have similar interquartile ranges (subjectively analyzed with boxplots).

Some of our current questions are:

  • can we even compare these two technologies like that?
  • which method should we choose for transforming/normalizing the data?
  • and should we expect aligned boxplots for samples with the same class but coming from different datasets? (I mean, is there a method capable of doing so?)
rna-seq microarray normalization transformation • 2.0k views
ADD COMMENT
2
Entering edit mode
3.9 years ago
Mensur Dlakic ★ 28k

The answer to this depends on several considerations: 1) are your normalizations for different technologies internally consistent, if not between each other? 2) are you doing classification or regression in your ML pipeline? (assuming classification).

When using tree-based techniques, differences in scale between different columns (different technologies) are unimportant. These methods can simultaneously use categorical and continuous data, which are by definition not on the same scale. As a general rule, tree splits will be adjusted during the learning process and have no problem with scale. Still, if you are doing classification, it may help to reduce the cardinality (number of unique values per feature/column) by discretizing data (also called binning).There are many ways of doing this: uniform range width, uniform number of elements per range, etc. I like minimum description length principle which is entropy-based and easy to understand, but it is slow. It is easy to find more information by Googling, and there are several implementations on GitHub.

ADD COMMENT
0
Entering edit mode

@Mensur DIakic, If you know, would you please introduce a couple of reading/training materials on applying classifiers (like RF) on genomic data? For sure Google can help but need to have ideas from an experinced fellow. Thanks

ADD REPLY
1
Entering edit mode

I have never built a classifier specifically for RNA-Seq or microarray data, but there should be no major differences here from any other data type. As long as you set up proper cross-validation, random forests usually do not overfit and tend to work out of the box. If you want to squeeze the last bit out of it, extreme boosting methods such as xgboost and LightGBM can do even better, but they require greater care and some hyperparameter optimization.

ADD REPLY
0
Entering edit mode

Hello Professor Dlakic, first of all, thanks for your answer.

I'm afraid I might've miscommunicated our needs. We already have a decent classification model (using a Gradient Boosting Classifier and also SVM) for each dataset isolated (validated by a cross-validation process). The individual data transformations are, I think, well-carried and the data preprocessing looks pretty good. By the way, the Microarray datasets came specifically from a study that published an extensively curated microarray database for ML benchmarking.

However, what we really want to do is to be able to train a model in, let's say, an RNA-Seq dataset, and use this trained model to classify samples comming from different datasets, which may be originated also from RNA-Seq as well as Microarray experiments. So the transformation method should be able to "map" the new data to the same IQR/median as the training data (now, this is what we thought and started to question).

As we know, not only these methods work differently but there are also variations from one technology to another (e.g, different microarray platforms). There already are some scientific efforts to understad how could we eliminate those lab differences without interfering in the biological information of the data. We even reviewed a few studies investigating cross-platform normalization techniques, but we were still not able to achieve that similar IQR/median between different datasets I mentioned above. Maybe, as the preliminar results suggest, we should forget about this and simply MinMax the preprocessed data?

ADD REPLY
1
Entering edit mode

By the way, the Microarray datasets came specifically from a study that published an extensively curated microarray database for ML benchmarking.

I have no first-hand experience with this kind of data. That said, it seems reasonable to repeat the standardization procedure with future data that they already did to create the benchmarking dataset.

I would not rely on the fact that MinMax scaling happens to work on some unseen data you have tried. As you know, the sigmoid has a property of being relatively insensitive to small changes in some parts of the plot, but there is a part where it rises steeply. It may be that the datasets you tried so far were on a similar enough scale as your training data that a simple MinMax was sufficient. I would hesitate to extrapolate that to all future datasets. At the very least I would try subtracting the mean and scaling the variance, though you may need to re-train the original classifier using the same approach.

ADD REPLY
0
Entering edit mode
3.9 years ago
JC 13k

can we even compare these two technologies like that?

no, the methods have a big scale difference and are not directly comparable, check for differences in expression in microarray (cDNA and tilled arrays) and RNA-seq. You could compare at some level using relative expression (DEG) or ranking

which method should we choose for transforming/normalizing the data?

I would prefer to normalize based on relative log-ratio using the normal matching as control separately

and should we expect aligned boxplots for samples with the same class but coming from different datasets? (I mean, is there a method capable of doing so?)

not sure what are you asking here

ADD COMMENT
0
Entering edit mode

Hi JC, sorry for the lack of information in the post.

Our data is already log-transformed, but this and another preprocessing steps were carried for each dataset individually. Without performing any cross-dataset data transformation, our classification model was able to achieve a decent ROC-AUC for breast cancer (well, in 2 of the 3 testing datasets). However, using the same ML pipeline, another model trained for lung cancer classification performed no better than a coin flip. This was not a surprise since we didn't scale/transform the test data to the same scale of our training data (and in fact a MinMax scaling was good enough to achieve more than 0.85 accuracy with only 50 genes for new-never-seen Microarray data using a classification model trained only with RNA-seq data).

Now the question is: what method is more adequate to cross-platform normalize datasets? Is it mandatory having a smilar IQR/median? As I mentioned in my reply to Professor Dlakic, there already are some publications investigating that, but we were not able to match their "normalization performance" yet (at least not visually, hahaha).

ADD REPLY

Login before adding your answer.

Traffic: 2209 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6