Question: Machine learning features from nucleotide sequences
2
gravatar for rajeshkumar_vinod
4.2 years ago by
india
rajeshkumar_vinod30 wrote:

What possible features can be extracted from nucleotide sequences for machine learning? For example gc content, dinucleotide frequency etc.

machine learning • 3.0k views
ADD COMMENTlink modified 4.2 years ago by Khader Shameer18k • written 4.2 years ago by rajeshkumar_vinod30
2

Good question, but remember that a tool looking for a job rarely ends up doing that job well.

ADD REPLYlink written 4.2 years ago by John12k
3
gravatar for Khader Shameer
4.2 years ago by
Manhattan, NY
Khader Shameer18k wrote:

There are multiple ways to compile your features:

1) Knowledge-based approach: here you would only use a set of limited features that have a direct influence on your prediction/classification/learning task. Feature set will be limited, and you won't be able to add new knowledge to the field. See an example where we used a subset of features that we assumed to have role in 3D domain swapping

2) Data-driven approach: you can compile all available features that you can gather from your nucleotides (DNA or RNA?) and test them using rigorous feature selection method See an example where we used the entire set of features with hybrid features (combining multiple features) to predict 3D domain swapping

3) Feature engineering/representation learning: you can either of the above set and use deep neural encoding methods, here the algorithm would engineer the features (NN, RBM, PCA, LSTM, etc.). This approach is more applicable when you have large dataset(s) and not primarily looking for features contributing to your predictive model such as feature selection or biological inference.

  • Count(s) of individual bases (ATGC - mono, bi, tri...)
  • k-mer count (See previous answers)
  • physicochemical properties of your sequences
  • evolutionary scores (example here)
  • mutation/substitution scores (GERP, PhyloP, etc.)
  • Annotation-based features (part of gene-structure (exon-intron), coding or non-coding etc.)

PS. Like one of the answers, it all depends on your prediction problem

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Khader Shameer18k
2
gravatar for ebrahimiet
4.2 years ago by
ebrahimiet40
ebrahimiet40 wrote:

In the following papers, we used a range of nucleotide features

Gene Volume 578, Issue 2, 10 March 2016, Pages 194–204 Unravelling evolution of Nanog, the key transcription factor involved in self-renewal of undifferentiated embryonic stem cells, by pattern recognition in nucleotide and tandem repeats characteristics

BMC Research Notes20147:565 DOI: 10.1186/1756-0500-7-565 Prediction of hepatitis C virus interferon/ribavirin therapy outcome based on viral nucleotide attributes using machine learning algorithms

ADD COMMENTlink written 4.2 years ago by ebrahimiet40
1
gravatar for shenwei356
4.2 years ago by
shenwei3565.5k
China
shenwei3565.5k wrote:
  • k-mer, a very important one.
  • secondary structure
ADD COMMENTlink written 4.2 years ago by shenwei3565.5k

It should not be related to structure only thing we can get from sequence. And for k mer how should i choose which k mer is best for me?

ADD REPLYlink written 4.2 years ago by rajeshkumar_vinod30

you may try different Ks. in some field, secondary structure may help.

ADD REPLYlink written 4.2 years ago by shenwei3565.5k
0
gravatar for WouterDeCoster
4.2 years ago by
Belgium
WouterDeCoster44k wrote:

I would consider transcription factor binding consensus motifs an interesting feature, among others like promotors, poly adenylation signals, conservation across species,... But these meta-features (I just made that up) need an external annotation so maybe that's not what you're looking for.

What is the purpose of your analysis?

ADD COMMENTlink written 4.2 years ago by WouterDeCoster44k

i need to do predictions

ADD REPLYlink written 4.2 years ago by rajeshkumar_vinod30

Aaaaah predictions. That's oddly specific.

ADD REPLYlink written 4.2 years ago by WouterDeCoster44k
0
gravatar for O.rka
4.2 years ago by
O.rka210
O.rka210 wrote:

This one is probably my favorite. It uses k-mer counts with a t-sne algorithm to cluster contigs into bins of organisms. Used for binning out organisms from a metagenome. http://claczny.github.io/VizBin/ .

By machine learning, are you talking about doing predictions or clustering?

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by O.rka210

predictions i have a very interesting problem that i can't discuss right now.

ADD REPLYlink written 4.2 years ago by rajeshkumar_vinod30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2099 users visited in the last hour