Question

How to grab a column out of pandas data frame for encoding

0

Entering edit mode

4.5 years ago

pracharmara • 0

Hi Everyone, I'm trying to use substitution matrix to encode peptide sequence using pandas for Pytorch ML model. The sub. matrix has 20 columns and 20 rows, I want to substitute a letter from a peptide in peps e.g. "A" with 20 values in sub_matrix column A. I tried using .to_string(index=False), but it returns other characters and the values are actually floats not strings so it's definitely not ideal.

What can I use instead to get only the values w/o spaces and new lines?

Also I would be super glad if anyone can suggest what is the best way to process this data in Pytorch? I have previously used some ML packages in R, where all the values would be in one data frame. Is it good to have all in one list and then convert to a tensor or having a list for each peptide?

my pandas data frame:

sub_matrix = pd.read_csv('blosum62_pd_ori.txt', header = 0, nrows = 20)

sub_matrix

    A       R         ...  V

0   0.2901  0.0310    ...  0.0688
1   0.0446  0.3450    ...  0.0310
2   0.0427  0.0449    ...  0.0270
... ...   ...         ...  ...
17  0.0303  0.0227    ...  0.0303
18  0.0405  0.0280    ...  0.0467
19  0.0700  0.0219    ...  0.2689

peps = ['GARRNDACE', 'QEERGGDPA']

the code:

def encode(pep):
    AAs = list(pep)
    encoded = []
    for aa in AAs:
        if aa in sub_matrix.columns:
        freqs = sub_matrix[aa].to_string(index=False)
        encoded.append(freqs)
    return encoded

for pep in peps:
    print(encode(pep))

I would like the output to be one non-nested list or all values, like:

['0.0783', '0.0329', '0.0652', '0.0466', '0.0325', '0.0412', '0.0350', ..., '0.5101', '0.0382', '0.0206', '0.0213', '0.0432', '0.0281', '0.0254', '0.0233']

but now it is:

[' 0.0783\n 0.0329\n 0.0652\n 0.0161\n 0.0106\n 0.0103\n 0.0175\n 0.0178\n ... ,' 0.0405\n 0.0523\n 0.0494\n 0.0914\n 0.0163\n 0.1029\n 0.2965']

[' 0.0256\n 0.0484\n 0.0337\n 0.0299\n 0.0122\n 0.2147\n 0.0645\n 0.0189\n ... ,' 0.0338\n 0.0568\n 0.1099\n 0.0730\n 0.0303\n 0.0405\n 0.0700']

pandas python encoding arrays pre-processing • 734 views

ADD COMMENT • link 4.5 years ago by pracharmara • 0

score 1 · Answer 1 · 2019-10-28

1

Entering edit mode

4.5 years ago

shoujun.gu ▴ 350

modify line7 to : encoded.extend(freqs.split('\n'))

ADD COMMENT • link 4.5 years ago by shoujun.gu ▴ 350

0

Entering edit mode

Thank you. I have actually realised this simple thing myself! Feels a bit silly, thanks.

ADD REPLY • link 4.5 years ago by pracharmara • 0