How to grab a column out of pandas data frame for encoding
1
0
Entering edit mode
4.5 years ago

Hi Everyone, I'm trying to use substitution matrix to encode peptide sequence using pandas for Pytorch ML model. The sub. matrix has 20 columns and 20 rows, I want to substitute a letter from a peptide in peps e.g. "A" with 20 values in sub_matrix column A. I tried using .to_string(index=False), but it returns other characters and the values are actually floats not strings so it's definitely not ideal.

What can I use instead to get only the values w/o spaces and new lines?

Also I would be super glad if anyone can suggest what is the best way to process this data in Pytorch? I have previously used some ML packages in R, where all the values would be in one data frame. Is it good to have all in one list and then convert to a tensor or having a list for each peptide?

my pandas data frame:

sub_matrix = pd.read_csv('blosum62_pd_ori.txt', header = 0, nrows = 20)

sub_matrix

    A       R         ...  V

0   0.2901  0.0310    ...  0.0688
1   0.0446  0.3450    ...  0.0310
2   0.0427  0.0449    ...  0.0270
... ...   ...         ...  ...
17  0.0303  0.0227    ...  0.0303
18  0.0405  0.0280    ...  0.0467
19  0.0700  0.0219    ...  0.2689

peps = ['GARRNDACE', 'QEERGGDPA']

the code:

def encode(pep):
    AAs = list(pep)
    encoded = []
    for aa in AAs:
        if aa in sub_matrix.columns:
        freqs = sub_matrix[aa].to_string(index=False)
        encoded.append(freqs)
    return encoded

for pep in peps:
    print(encode(pep))

I would like the output to be one non-nested list or all values, like:

['0.0783', '0.0329', '0.0652', '0.0466', '0.0325', '0.0412', '0.0350', ..., '0.5101', '0.0382', '0.0206', '0.0213', '0.0432', '0.0281', '0.0254', '0.0233']

but now it is:

[' 0.0783\n 0.0329\n 0.0652\n 0.0161\n 0.0106\n 0.0103\n 0.0175\n 0.0178\n ... ,' 0.0405\n 0.0523\n 0.0494\n 0.0914\n 0.0163\n 0.1029\n 0.2965']

[' 0.0256\n 0.0484\n 0.0337\n 0.0299\n 0.0122\n 0.2147\n 0.0645\n 0.0189\n ... ,' 0.0338\n 0.0568\n 0.1099\n 0.0730\n 0.0303\n 0.0405\n 0.0700']

pandas python encoding arrays pre-processing • 734 views
ADD COMMENT
1
Entering edit mode
4.5 years ago
shoujun.gu ▴ 350

modify line7 to : encoded.extend(freqs.split('\n'))

ADD COMMENT
0
Entering edit mode

Thank you. I have actually realised this simple thing myself! Feels a bit silly, thanks.

ADD REPLY

Login before adding your answer.

Traffic: 2865 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6