Question: Python- Function To Compare A Sequnces In Multisequence Fasta
0
gravatar for st.ph.n
6.0 years ago by
st.ph.n2.5k
Philadelphia, PA
st.ph.n2.5k wrote:

I'm trying to write a python script, given a fasta, with multiple records, to count the number of matches and mismatches, excluding gaps. The fasta file was created using clustalw2 through python. I have the following, which will output the number of matches, mismatches, and gaps, given only two sequences in a list. However, my fasta file contains 7 sequences. I need help formatting this into a function to print out the number for all pairwise alignments.

The following code works if there are only two records in the fasta file. To compare the 7 sequences, I would need it to do 1:1, 1:2, 1:3 ..and so on. and again for 2:2, 2:3, 2:4, and so on. When i replace seqs[0] and seqs[1] with any other number say 2, 4 to compare those two items in the list, the code does not work.

for record in SeqIO.parse("ex.fasta", "fasta"):
rec_seq = record.seq
seqs.append(list(rec_seq))
    matchstr = ''
for (i, base) in enumerate(seqs[0]):
    base = seqs[0][i]
    for j in range(1,len(seqs)):
        if base == '-':
            base = 'G'
        elif seqs[0][i] == '-':
            base = 'G'
        elif seqs[1][i] == '-':
            base = 'G'
        elif base!=seqs[j][i]:
            base = 'X'
        elif base == seqs[j][i]:
            base = 'Y'
    matchstr = matchstr + base
print matchstr
D = matchstr.count('X')
print D
S = matchstr.count('Y')
print S
G = matchstr.count('G')
print G
fasta python function • 3.0k views
ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by st.ph.n2.5k
1
gravatar for Xingyu Yang
6.0 years ago by
Xingyu Yang260
Atlanta
Xingyu Yang260 wrote:

Slightly changed your code, I think it works now. However, there're definetely better solution to these question.

   seqs=[]
   for record in SeqIO.parse("ex.fasta", "fasta"):
        rec_seq = record.seq
        seqs.append(str(rec_seq))
   for j in range(1,len(seqs)):
        matchstr=''
        for (i, base) in enumerate(seqs[0]):
            base = seqs[0][i]
            if base == '-' or seqs[j][i]=='-':
                base = 'G'
            elif base!=seqs[j][i]:
                base = 'X'
            elif base == seqs[j][i]:
                base = 'Y'
            matchstr = matchstr + base
        print matchstr   #### ATTENTION here, from here, they are not in the second loop!!!!!!!!!!!!!!!!!!
        D = matchstr.count('X')
        print D
        S = matchstr.count('Y')
        print S
        G = matchstr.count('G')
        print G
ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by Xingyu Yang260

Can you explain your changes? I know there were some indenting erros when entering the question. Your changes are good, however they do not solve the problem, and print the matchstr out everytime there is a match. I only want to print the matches, mismatches, and gaps one time, with the values for D, S, G printed one time. The code above works, if there are only two records in the fasta. Howeever there 7. I need to get the matches and mismatches between all of them: 1:2, 1:3, 1:4 and so on..as descrbied above. Replacing the values for another item in the list other than 0 and 1 do not work.

This is the output for a fasta containing only two sequences:

Sequence1 GGTACCTTTTAAGTCTC---CTGACTGGTTATATATGAGCCCTGAGTTTTAAACGTAGGCAC--GATGACCCATGCATCCATGTACC---GTCCCACAAATGTGATCACCACACTGTGTAGCGTACGTTGGGTCACTGGTACCCCTGTTATGCATGCCAATGTGCACATTGGACGTGTGACTGACTGTACCTGACTGAAA

Sequence2 GCTTCGTCGCACTTAACCTCCGGACAGGCTATAGAAGAGTCCTAAGTTTTCAACT--GGCACTTGATTCTCCATATATCCCTGTCCCCTTGTGTCAACAATTTGATCACCTCTCAGTGTAGCCTACTTTGAGTCCCTGGTATCCCAGTTATGAACGACTATGAACTCAC--TCCGTGTAACTGAATGTACCTGTCTG-AT

OUTPUT: YXYXYXYXXXYXXYXXYGGGYXYYYXYYXYYYYXYXYYYXYYYXYYYYYYXYYYXGGYYYYYGGYYYXXXYYYYXXYYYYXYYYXYYGGGYYXXYYXXYYYXYYYYYYYYXYXYXYYYYYYYXYYYXYYYXYYYXYYYYYYXYYYXYYYYYYXYXYXYXYYYXXYXYYXGGXXYYYYYXYYYYYXYYYYYYYYXYYYGYX D=54 S=133 G=13

I need to loop through the code above, to do this for many sequences in a fasta.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by st.ph.n2.5k

Did you copy the indent correct? If there're only 7 sequences, it will only print it 7 times. The main loop for j is to compare your base sequence to (j+1)th sequence. For each j, print it out once.

I made some additional comment in the script, make sure you copy it correct.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Xingyu Yang260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1123 users visited in the last hour