There are multiple ways to compile your features:
1) Knowledge-based approach: here you would only use a set of limited features that have a direct influence on your prediction/classification/learning task. Feature set will be limited, and you won't be able to add new knowledge to the field. See an example where we used a subset of features that we assumed to have role in 3D domain swapping
2) Data-driven approach: you can compile all available features that you can gather from your nucleotides (DNA or RNA?) and test them using rigorous feature selection method See an example where we used the entire set of features with hybrid features (combining multiple features) to predict 3D domain swapping
3) Feature engineering/representation learning: you can either of the above set and use deep neural encoding methods, here the algorithm would engineer the features (NN, RBM, PCA, LSTM, etc.). This approach is more applicable when you have large dataset(s) and not primarily looking for features contributing to your predictive model such as feature selection or biological inference.
- Count(s) of individual bases (ATGC - mono, bi, tri...)
- k-mer count (See previous answers)
- physicochemical properties of your sequences
- evolutionary scores (example here)
- mutation/substitution scores (GERP, PhyloP, etc.)
- Annotation-based features (part of gene-structure (exon-intron), coding or non-coding etc.)
PS. Like one of the answers, it all depends on your prediction problem
In the following papers, we used a range of nucleotide features
Gene Volume 578, Issue 2, 10 March 2016, Pages 194–204 Unravelling evolution of Nanog, the key transcription factor involved in self-renewal of undifferentiated embryonic stem cells, by pattern recognition in nucleotide and tandem repeats characteristics
BMC Research Notes20147:565 DOI: 10.1186/1756-0500-7-565 Prediction of hepatitis C virus interferon/ribavirin therapy outcome based on viral nucleotide attributes using machine learning algorithms
I would consider transcription factor binding consensus motifs an interesting feature, among others like promotors, poly adenylation signals, conservation across species,... But these
meta-features (I just made that up) need an external annotation so maybe that's not what you're looking for.
What is the purpose of your analysis?
This one is probably my favorite. It uses
k-mer counts with a
t-sne algorithm to cluster contigs into bins of organisms. Used for binning out organisms from a metagenome. http://claczny.github.io/VizBin/ .
By machine learning, are you talking about doing predictions or clustering?