How to create a dataset using sequence file in python
1
0
Entering edit mode
8.7 years ago
Jason Lin • 0

I have a protein sequence file looks like this:

>102L:A MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL       -------------------------------------------------------------------------------------------------------------------------------------------------------------------XX


The first one is the name of the sequence, the second one is the actual protein sequence, and the first one is the indicator that shows if there is any missing coordinates. In this case, notice that there is two "X" in the end. That means that the last two residue of the sequence witch are "NL" in this case are missing coordinates.

By coding in Python I would like to generate a table which should look like this:

1. name of the sequence
2. total number of missing coordinates (which is the number of X)
3. the range of these missing coordinates (which is the range of the position of those X)
4. the length of the sequence
5. the actual sequence

So the final results should looks like this:

>102L:A 2 163-164 164 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL


And my code looks like this so far:

total_seq = []
with open('sample.txt') as lines:
for l in lines:
split_list = l.split()

# Assign the list number
seq = split_list                                   # 5
disorder = split_list

# count sequence length and total residue of missing coordinates
sequence_length = len(seq)                            # 4

for x in disorder:
counts = 0
if x == 'X':
counts = counts + 1

total_seq.append([header, seq, str(counts)])   # obviously I haven't finish coding 2 & 3

with open('new_sample.txt', 'a') as f:
for lol in total_seq:
f.write('\n'.join(lol))


I'm new in python, would anyone help please, thank you so much guys!

python • 4.5k views
0
Entering edit mode

0
Entering edit mode

It helped. But for this I still don't understand how to solve number 2 and 3 in my goal. which is the total number of missing coordinates and the range of those missing coordinates.

0
Entering edit mode
8.7 years ago
Zhaorong ★ 1.4k

For question 2):

disorder = '---XX--XXX--'
print disorder.count('X')


This uses string's count() method.

For question 3):

from itertools import groupby, count
indices = [i for i, x in enumerate(disorder) if x=='X']

def as_range(iterable): # not sure how to do this part elegantly
l = list(iterable)
if len(l) > 1:
return '{0}-{1}'.format(l, l[-1])
else:
return '{0}'.format(l)

print ','.join(as_range(g) for _, g in groupby(indices, key=lambda n, c=count():\
n-next(c)))


This is more complicated. You may want to read these: