Question

Python- Function To Compare A Sequnces In Multisequence Fasta

0

Entering edit mode

11.3 years ago

st.ph.n ★ 2.7k

I'm trying to write a python script, given a fasta, with multiple records, to count the number of matches and mismatches, excluding gaps. The fasta file was created using clustalw2 through python. I have the following, which will output the number of matches, mismatches, and gaps, given only two sequences in a list. However, my fasta file contains 7 sequences. I need help formatting this into a function to print out the number for all pairwise alignments.

The following code works if there are only two records in the fasta file. To compare the 7 sequences, I would need it to do 1:1, 1:2, 1:3 ..and so on. and again for 2:2, 2:3, 2:4, and so on. When i replace seqs[0] and seqs[1] with any other number say 2, 4 to compare those two items in the list, the code does not work.

for record in SeqIO.parse("ex.fasta", "fasta"):
rec_seq = record.seq
seqs.append(list(rec_seq))
    matchstr = ''
for (i, base) in enumerate(seqs[0]):
    base = seqs[0][i]
    for j in range(1,len(seqs)):
        if base == '-':
            base = 'G'
        elif seqs[0][i] == '-':
            base = 'G'
        elif seqs[1][i] == '-':
            base = 'G'
        elif base!=seqs[j][i]:
            base = 'X'
        elif base == seqs[j][i]:
            base = 'Y'
    matchstr = matchstr + base
print matchstr
D = matchstr.count('X')
print D
S = matchstr.count('Y')
print S
G = matchstr.count('G')
print G

python fasta function • 4.9k views

ADD COMMENT • link 11.3 years ago by st.ph.n ★ 2.7k

score 1 · Answer 1 · 2014-03-19

1

Entering edit mode

11.3 years ago

Xingyu Yang ▴ 280

Slightly changed your code, I think it works now. However, there're definetely better solution to these question.

   seqs=[]
   for record in SeqIO.parse("ex.fasta", "fasta"):
        rec_seq = record.seq
        seqs.append(str(rec_seq))
   for j in range(1,len(seqs)):
        matchstr=''
        for (i, base) in enumerate(seqs[0]):
            base = seqs[0][i]
            if base == '-' or seqs[j][i]=='-':
                base = 'G'
            elif base!=seqs[j][i]:
                base = 'X'
            elif base == seqs[j][i]:
                base = 'Y'
            matchstr = matchstr + base
        print matchstr   #### ATTENTION here, from here, they are not in the second loop!!!!!!!!!!!!!!!!!!
        D = matchstr.count('X')
        print D
        S = matchstr.count('Y')
        print S
        G = matchstr.count('G')
        print G

ADD COMMENT • link 11.3 years ago by Xingyu Yang ▴ 280

0

Entering edit mode

Can you explain your changes? I know there were some indenting erros when entering the question. Your changes are good, however they do not solve the problem, and print the matchstr out everytime there is a match. I only want to print the matches, mismatches, and gaps one time, with the values for D, S, G printed one time. The code above works, if there are only two records in the fasta. Howeever there 7. I need to get the matches and mismatches between all of them: 1:2, 1:3, 1:4 and so on..as descrbied above. Replacing the values for another item in the list other than 0 and 1 do not work.

This is the output for a fasta containing only two sequences:

Sequence1 GGTACCTTTTAAGTCTC---CTGACTGGTTATATATGAGCCCTGAGTTTTAAACGTAGGCAC--GATGACCCATGCATCCATGTACC---GTCCCACAAATGTGATCACCACACTGTGTAGCGTACGTTGGGTCACTGGTACCCCTGTTATGCATGCCAATGTGCACATTGGACGTGTGACTGACTGTACCTGACTGAAA

Sequence2 GCTTCGTCGCACTTAACCTCCGGACAGGCTATAGAAGAGTCCTAAGTTTTCAACT--GGCACTTGATTCTCCATATATCCCTGTCCCCTTGTGTCAACAATTTGATCACCTCTCAGTGTAGCCTACTTTGAGTCCCTGGTATCCCAGTTATGAACGACTATGAACTCAC--TCCGTGTAACTGAATGTACCTGTCTG-AT

OUTPUT: YXYXYXYXXXYXXYXXYGGGYXYYYXYYXYYYYXYXYYYXYYYXYYYYYYXYYYXGGYYYYYGGYYYXXXYYYYXXYYYYXYYYXYYGGGYYXXYYXXYYYXYYYYYYYYXYXYXYYYYYYYXYYYXYYYXYYYXYYYYYYXYYYXYYYYYYXYXYXYXYYYXXYXYYXGGXXYYYYYXYYYYYXYYYYYYYYXYYYGYX D=54 S=133 G=13

I need to loop through the code above, to do this for many sequences in a fasta.

ADD REPLY • link 11.3 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Did you copy the indent correct? If there're only 7 sequences, it will only print it 7 times. The main loop for j is to compare your base sequence to (j+1)th sequence. For each j, print it out once.

I made some additional comment in the script, make sure you copy it correct.

ADD REPLY • link 11.3 years ago by Xingyu Yang ▴ 280