Question

Do we still really need to remove the sequence (Protein or Genome) redundancy when using deep learning approaches to construct prediction models?

0

Entering edit mode

5.9 years ago

kurdt325 • 0

Removing sequence redundancy is a crucial step for protein or genome sequence related analysis in bioinformatics, especially using traditional machine learning methods, such as SVM, random forest, decision tree etc. By removing the sequence redundancy, the dataset can be clean and reliable for the model to catch the primary classification boundaries, which is also a great way to make the dataset smaller to improve the time cost in model training. Another major benefit is to avoid the model overfitting. Removing sequence redundancy is quite important to the computational sequence analysis.

But when this meet the deep learning, should we rethink this problem from scratch?

Firstly, deep learning models are more complex than traditional machine learning methods, large-scale dataset is required for the model training. For this reason, image-based deep learning model training usually use the image data augmentation (rotation, shift, adding noise) to generate more images for the model training. Within the sequence analysis, do we still need to remove the naturally existed sequences?

Secondly, deep learning approach has many techniques to avoid the model overfitting, such as using dropout, batch normalisation layer, pooling layer etc.

So, is it really necessary to remove the sequence (Protein or Genome) redundancy when using deep learning approaches to construct prediction models?

sequence-redundancy • 615 views

ADD COMMENT • link updated 2.0 years ago by Ram 45k • written 5.9 years ago by kurdt325 • 0