Python- Function To Compare A Sequnces In Multisequence Fasta
1
0
Entering edit mode
10.1 years ago
st.ph.n ★ 2.7k

I'm trying to write a python script, given a fasta, with multiple records, to count the number of matches and mismatches, excluding gaps. The fasta file was created using clustalw2 through python. I have the following, which will output the number of matches, mismatches, and gaps, given only two sequences in a list. However, my fasta file contains 7 sequences. I need help formatting this into a function to print out the number for all pairwise alignments.

The following code works if there are only two records in the fasta file. To compare the 7 sequences, I would need it to do 1:1, 1:2, 1:3 ..and so on. and again for 2:2, 2:3, 2:4, and so on. When i replace seqs[0] and seqs[1] with any other number say 2, 4 to compare those two items in the list, the code does not work.

for record in SeqIO.parse("ex.fasta", "fasta"):
rec_seq = record.seq
seqs.append(list(rec_seq))
    matchstr = ''
for (i, base) in enumerate(seqs[0]):
    base = seqs[0][i]
    for j in range(1,len(seqs)):
        if base == '-':
            base = 'G'
        elif seqs[0][i] == '-':
            base = 'G'
        elif seqs[1][i] == '-':
            base = 'G'
        elif base!=seqs[j][i]:
            base = 'X'
        elif base == seqs[j][i]:
            base = 'Y'
    matchstr = matchstr + base
print matchstr
D = matchstr.count('X')
print D
S = matchstr.count('Y')
print S
G = matchstr.count('G')
print G
python fasta function • 4.5k views
ADD COMMENT
1
Entering edit mode
10.1 years ago
Xingyu Yang ▴ 280

Slightly changed your code, I think it works now. However, there're definetely better solution to these question.

   seqs=[]
   for record in SeqIO.parse("ex.fasta", "fasta"):
        rec_seq = record.seq
        seqs.append(str(rec_seq))
   for j in range(1,len(seqs)):
        matchstr=''
        for (i, base) in enumerate(seqs[0]):
            base = seqs[0][i]
            if base == '-' or seqs[j][i]=='-':
                base = 'G'
            elif base!=seqs[j][i]:
                base = 'X'
            elif base == seqs[j][i]:
                base = 'Y'
            matchstr = matchstr + base
        print matchstr   #### ATTENTION here, from here, they are not in the second loop!!!!!!!!!!!!!!!!!!
        D = matchstr.count('X')
        print D
        S = matchstr.count('Y')
        print S
        G = matchstr.count('G')
        print G
ADD COMMENT
0
Entering edit mode

Can you explain your changes? I know there were some indenting erros when entering the question. Your changes are good, however they do not solve the problem, and print the matchstr out everytime there is a match. I only want to print the matches, mismatches, and gaps one time, with the values for D, S, G printed one time. The code above works, if there are only two records in the fasta. Howeever there 7. I need to get the matches and mismatches between all of them: 1:2, 1:3, 1:4 and so on..as descrbied above. Replacing the values for another item in the list other than 0 and 1 do not work.

This is the output for a fasta containing only two sequences:

Sequence1 GGTACCTTTTAAGTCTC---CTGACTGGTTATATATGAGCCCTGAGTTTTAAACGTAGGCAC--GATGACCCATGCATCCATGTACC---GTCCCACAAATGTGATCACCACACTGTGTAGCGTACGTTGGGTCACTGGTACCCCTGTTATGCATGCCAATGTGCACATTGGACGTGTGACTGACTGTACCTGACTGAAA

Sequence2 GCTTCGTCGCACTTAACCTCCGGACAGGCTATAGAAGAGTCCTAAGTTTTCAACT--GGCACTTGATTCTCCATATATCCCTGTCCCCTTGTGTCAACAATTTGATCACCTCTCAGTGTAGCCTACTTTGAGTCCCTGGTATCCCAGTTATGAACGACTATGAACTCAC--TCCGTGTAACTGAATGTACCTGTCTG-AT

OUTPUT: YXYXYXYXXXYXXYXXYGGGYXYYYXYYXYYYYXYXYYYXYYYXYYYYYYXYYYXGGYYYYYGGYYYXXXYYYYXXYYYYXYYYXYYGGGYYXXYYXXYYYXYYYYYYYYXYXYXYYYYYYYXYYYXYYYXYYYXYYYYYYXYYYXYYYYYYXYXYXYXYYYXXYXYYXGGXXYYYYYXYYYYYXYYYYYYYYXYYYGYX D=54 S=133 G=13

I need to loop through the code above, to do this for many sequences in a fasta.

ADD REPLY
0
Entering edit mode

Did you copy the indent correct? If there're only 7 sequences, it will only print it 7 times. The main loop for j is to compare your base sequence to (j+1)th sequence. For each j, print it out once.

I made some additional comment in the script, make sure you copy it correct.

ADD REPLY

Login before adding your answer.

Traffic: 2433 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6