Machine learning features from nucleotide sequences
5
2
Entering edit mode
7.7 years ago

What possible features can be extracted from nucleotide sequences for machine learning? For example gc content, dinucleotide frequency etc.

Machine learning • 4.6k views
ADD COMMENT
2
Entering edit mode

Good question, but remember that a tool looking for a job rarely ends up doing that job well.

ADD REPLY
3
Entering edit mode
7.7 years ago

There are multiple ways to compile your features:

1) Knowledge-based approach: here you would only use a set of limited features that have a direct influence on your prediction/classification/learning task. Feature set will be limited, and you won't be able to add new knowledge to the field. See an example where we used a subset of features that we assumed to have role in 3D domain swapping

2) Data-driven approach: you can compile all available features that you can gather from your nucleotides (DNA or RNA?) and test them using rigorous feature selection method See an example where we used the entire set of features with hybrid features (combining multiple features) to predict 3D domain swapping

3) Feature engineering/representation learning: you can either of the above set and use deep neural encoding methods, here the algorithm would engineer the features (NN, RBM, PCA, LSTM, etc.). This approach is more applicable when you have large dataset(s) and not primarily looking for features contributing to your predictive model such as feature selection or biological inference.

  • Count(s) of individual bases (ATGC - mono, bi, tri...)
  • k-mer count (See previous answers)
  • physicochemical properties of your sequences
  • evolutionary scores (example here)
  • mutation/substitution scores (GERP, PhyloP, etc.)
  • Annotation-based features (part of gene-structure (exon-intron), coding or non-coding etc.)

PS. Like one of the answers, it all depends on your prediction problem

ADD COMMENT
2
Entering edit mode
7.7 years ago
ebrahimiet ▴ 50

In the following papers, we used a range of nucleotide features

Gene Volume 578, Issue 2, 10 March 2016, Pages 194–204 Unravelling evolution of Nanog, the key transcription factor involved in self-renewal of undifferentiated embryonic stem cells, by pattern recognition in nucleotide and tandem repeats characteristics

BMC Research Notes20147:565 DOI: 10.1186/1756-0500-7-565 Prediction of hepatitis C virus interferon/ribavirin therapy outcome based on viral nucleotide attributes using machine learning algorithms

ADD COMMENT
1
Entering edit mode
7.7 years ago
  • k-mer, a very important one.
  • secondary structure
ADD COMMENT
0
Entering edit mode

It should not be related to structure only thing we can get from sequence. And for k mer how should i choose which k mer is best for me?

ADD REPLY
0
Entering edit mode

you may try different Ks. in some field, secondary structure may help.

ADD REPLY
0
Entering edit mode
7.7 years ago

I would consider transcription factor binding consensus motifs an interesting feature, among others like promotors, poly adenylation signals, conservation across species,... But these meta-features (I just made that up) need an external annotation so maybe that's not what you're looking for.

What is the purpose of your analysis?

ADD COMMENT
0
Entering edit mode

i need to do predictions

ADD REPLY
0
Entering edit mode

Aaaaah predictions. That's oddly specific.

ADD REPLY
0
Entering edit mode
7.7 years ago
O.rka ▴ 710

This one is probably my favorite. It uses k-mer counts with a t-sne algorithm to cluster contigs into bins of organisms. Used for binning out organisms from a metagenome. http://claczny.github.io/VizBin/ .

By machine learning, are you talking about doing predictions or clustering?

ADD COMMENT
0
Entering edit mode

predictions i have a very interesting problem that i can't discuss right now.

ADD REPLY

Login before adding your answer.

Traffic: 2539 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6