How to use list of numpy arrays to train ML algorithm?
0
0
Entering edit mode
4.6 years ago

I'm trying to develop a machine learning algorithm using LinearSVC and another one using Convolutional Neural Networks to classify DNA sequences. I've had to one hot encode the DNA sequences and then I stored the resulting arrays for each sequence in a list. But when I do the train-test split step I wasn't able to use it.

My DNA sequences are like this (not my real dataset, which is way bigger, just to exemplify. All the sequences are in the file 'seqs_for_test.fasta'):

>TE_seq1 CCATAAACTATCTAAATAAGCACTTTTCTGGCTCTCTGGCCCCCCTTCTTCTTTTTGGGAAGGTGACAG AGGGTAAAAGGGCTCTCTGCCGTGCGAGGCTCCTCACAGACACACAGCAAGAAAGAAGCGCCGCGCAGCA

>TE_seq2 GATAGCCCCTCTCCCAGCCCCAGTCTGATCCCTAACCCTAACTCCACGGCTCCTGTCTCTACCCCCGTCT CTTTCTTCTTGTACCCTAGTCCCCCAGATCATTAGCTCCCTGCTCGGGCCCAGGGTTTTAAGAGAAGCCC

>TE_seq3 TGACTCAAGTCATGCTACCCAGCCCCGTCTTCTTAAAAATGAGACATGTTGAGACACCCTGCTTTTCGCC TACAAACACATCCATTCTCTATACTTAGTCTTATTTAAATTCTATCCTCTGTATGTCTAGTCCTGGGGGT

>RD_seq4 TGCTCGCCCCCCAGGAAGTGCAGAGACCGCCTGGGTGTGACTGTTTTTAGGCCTAACAAAGGCACAGAAA CACCCGTGCGGTCTCTGTATCCCCTGGAGGTATTTCTCCCCATTAGTTTGCTTGACACTAAGTTTTTAAA

>RD_seq5 TAAAAAAAGCTTATTAAGTCCCTAGAACCTGGGACCTATCTACCCAAGTTTTAAAACCTTACTTTTAAGG CTACATTTTTTTATTTTGACTGTTTTACCATAAGGTCACATATAGGAAACCCCCACTGTCCTAATAAAAA

>RD_seq6 CTAATCTCCTGTTGGCTGACTTACATCAGTTTGGGAAGTTGTTCATGATGACTCTGCGACGATCAAGAAG GACCAGGACTCTCCCTGGACACCTCAGGGACTTCTTGCTGGAGGGCACCATACATCAGTTTGCCAGCAAA

Here is my code for LinearSVC:

import pandas as pd
import numpy as np
from numpy import array
from numpy import argmax
from Bio import SeqIO
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split


with open('../fasta/seqs_for_test.fasta') as fasta_file:  # Will close handle cleanly
    identifiers = []
    sequences = []
    for seq_record in SeqIO.parse(fasta_file, 'fasta'):  # (generator)
        identifiers.appendseq_record.id)
        sequences.append(seq_record.seq.lower())


s1 = pd.Series(identifiers, name='ID')
s2 = pd.Series(sequences, name='sequence')

# Gathering Series into a pandas DataFrame and rename index as ID column
fasta_frame = pd.DataFrame(dict(ID=s1, sequence=s2)).set_index(['ID'])
fasta_frame


label_serie = pd.Series()
fasta_frame.insert(1, "label", label_serie)

# Transposable element (TE) == 0; Random (RD) == 1.
fasta_frame.loc[fasta_frame.index.str.contains(r'TE_'),'label'] = 0
fasta_frame.loc[fasta_frame.index.str.contains(r'RD_'),'label'] = 1
fasta_frame


# empty list to store ohe array sequences
res_arr = []
for index, row in fasta_frame['sequence'].iteritems():
    # integer encode
    label_encoder = LabelEncoder()
    integer_encoded = label_encoder.fit_transform(row)
    # print(integer_encoded)
    # binary encode
    onehot_encoder = OneHotEncoder(sparse=False)
    integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
    onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
#     print(index)
#     print(onehot_encoded)
    # append ohe arrays
    res_arr.append(onehot_encoded)

y = fasta_frame['label']
# y

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(res_arr, 
                                                    y, 
                                                    test_size = 0.20, 
                                                    random_state=42)

# print(x_train)
# print(y_train)
# print(x_test)
# print(y_test)

from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
modelo = LinearSVC()
modelo.fit(x_train, y_train)
previsoes = modelo.predict(y_test)
acuracia = accuracy_score(y_test, previsoes) * 100
print("accuracy was %.2f%%" % acuracia)

I've tried to reshape, np.vstack and other ways but got no success. How can I use the list of arrays as my training set?

Error message:

ValueError: Found array with dim 3. Estimator expected <= 2.

python deep-learning machine-learning • 1.1k views
ADD COMMENT
0
Entering edit mode

you should provide which line of code got error, and also print the shape and head of res_arr. It looks like the shape of your array is not correct.

ADD REPLY
0
Entering edit mode

Hi, @shoujun.gu The line I got the error is

modelo.fit(x_train, y_train)

res_arr is a list of Numpy arrays. In this case, the example set, not my real data set, res_arr is composed by six arrays with shape (140, 4) for all of them. The arrays are like:

array([[0., 1., 0., 0.], [0., 0., 0., 1.], [1., 0., 0., 0.], [1., 0., 0., 0.], [0., 0., 0., 1.], [0., 1., 0., 0.], [0., 0., 0., 1.], [0., 1., 0., 0.], [0., 1., 0., 0.], [0., 0., 0., 1.], [0., 0., 1., 0.], [0., 0., 0., 1.], [0., 0., 0., 1.], [0., 0., 1., 0.], [0., 0., 1., 0.], [0., 1., 0., 0.], [0., 0., 0., 1.], [0., 0., 1., 0.], [1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 0., 1.], [0., 0., 0., 1.], [1., 0., 0., 0.], [0., 1., 0., 0.], [1., 0., 0., 0.], [0., 0., 0., 1.], [0., 1., 0., 0.], [1., 0., 0., 0.], [0., 0., 1., 0.], [0., 0., 0., 1.], [0., 0., 0., 1.], [0., 0., 0., 1.], [0., 0., 1., 0.], [0., 0., 1., 0.], [0., 0., 1., 0.], [1., 0., 0., 0.], [1., 0., 0., 0.], [0., 0., 1., 0.], [0., 0., 0., 1.], [0., 0., 0., 1.], [0., 0., 1., 0.], [0., 0., 0., 1.], [0., 0., 0., 1.], [0., 1., 0., 0.], [1., 0., 0., 0.], [0., 0., 0., 1.], [0., 0., 1., 0.], [1., 0., 0., 0.], [0., 0., 0., 1.], [0., 0., 1., 0.], [1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 0., 1.], [0., 1., 0., 0.], [0., 0., 0., 1.], [0., 0., 1., 0.], [0., 1., 0., 0.], [0., 0., 1., 0.], [1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 1., 0.], [1., 0., 0., 0.], [0., 0., 0., 1.], [0., 1., 0., 0.], [1., 0., 0., 0.], [1., 0., 0., 0.], [0., 0., 1., 0.], [1., 0., 0., 0.], [1., 0., 0., 0.], [0., 0., 1., 0.], [0., 0., 1., 0.], [1., 0., 0., 0.], [0., 1., 0., 0.], [0., 1., 0., 0.], [1., 0., 0., 0.], [0., 0., 1., 0.], [0., 0., 1., 0.], [1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 0., 1.], [0., 1., 0., 0.], [0., 0., 0., 1.], [0., 1., 0., 0.], [0., 1., 0., 0.], [0., 1., 0., 0.], [0., 0., 0., 1.], [0., 0., 1., 0.], [0., 0., 1., 0.], [1., 0., 0., 0.], [0., 1., 0., 0.], [1., 0., 0., 0.], [0., 1., 0., 0.], [0., 1., 0., 0.], [0., 0., 0., 1.], [0., 1., 0., 0.], [1., 0., 0., 0.], [0., 0., 1., 0.], [0., 0., 1., 0.], [0., 0., 1., 0.], [1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 0., 1.], [0., 0., 0., 1.], [0., 1., 0., 0.], [0., 0., 0., 1.], [0., 0., 0., 1.], [0., 0., 1., 0.], [0., 1., 0., 0.], [0., 0., 0., 1.], [0., 0., 1., 0.], [0., 0., 1., 0.], [1., 0., 0., 0.], [0., 0., 1., 0.], [0., 0., 1., 0.], [0., 0., 1., 0.], [0., 1., 0., 0.], [1., 0., 0., 0.], [0., 1., 0., 0.], [0., 1., 0., 0.], [1., 0., 0., 0.], [0., 0., 0., 1.], [1., 0., 0., 0.], [0., 1., 0., 0.], [1., 0., 0., 0.], [0., 0., 0., 1.], [0., 1., 0., 0.], [1., 0., 0., 0.], [0., 0., 1., 0.], [0., 0., 0., 1.], [0., 0., 0., 1.], [0., 0., 0., 1.], [0., 0., 1., 0.], [0., 1., 0., 0.], [0., 1., 0., 0.], [1., 0., 0., 0.], [0., 0., 1., 0.], [0., 1., 0., 0.], [1., 0., 0., 0.], [1., 0., 0., 0.], [1., 0., 0., 0.]])

It represents each A, C, G and T in the sequences. A=[1,0,0,0], C=[0,1,0,0], G=[0,0,1,0], T=[0,0,0,1].

ADD REPLY
0
Entering edit mode

You need to check the shape of your real training data, since the error message showed its a 3d array, not 2d.

ADD REPLY

Login before adding your answer.

Traffic: 1764 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6