I'm trying to write a python script, given a fasta, with multiple records, to count the number of matches and mismatches, excluding gaps. The fasta file was created using clustalw2 through python. I have the following, which will output the number of matches, mismatches, and gaps, given only two sequences in a list. However, my fasta file contains 7 sequences. I need help formatting this into a function to print out the number for all pairwise alignments.
The following code works if there are only two records in the fasta file. To compare the 7 sequences, I would need it to do 1:1, 1:2, 1:3 ..and so on. and again for 2:2, 2:3, 2:4, and so on. When i replace seqs and seqs with any other number say 2, 4 to compare those two items in the list, the code does not work.
for record in SeqIO.parse("ex.fasta", "fasta"): rec_seq = record.seq seqs.append(list(rec_seq)) matchstr = '' for (i, base) in enumerate(seqs): base = seqs[i] for j in range(1,len(seqs)): if base == '-': base = 'G' elif seqs[i] == '-': base = 'G' elif seqs[i] == '-': base = 'G' elif base!=seqs[j][i]: base = 'X' elif base == seqs[j][i]: base = 'Y' matchstr = matchstr + base print matchstr D = matchstr.count('X') print D S = matchstr.count('Y') print S G = matchstr.count('G') print G
Can you explain your changes? I know there were some indenting erros when entering the question. Your changes are good, however they do not solve the problem, and print the matchstr out everytime there is a match. I only want to print the matches, mismatches, and gaps one time, with the values for D, S, G printed one time. The code above works, if there are only two records in the fasta. Howeever there 7. I need to get the matches and mismatches between all of them: 1:2, 1:3, 1:4 and so on..as descrbied above. Replacing the values for another item in the list other than 0 and 1 do not work.
This is the output for a fasta containing only two sequences:
OUTPUT: YXYXYXYXXXYXXYXXYGGGYXYYYXYYXYYYYXYXYYYXYYYXYYYYYYXYYYXGGYYYYYGGYYYXXXYYYYXXYYYYXYYYXYYGGGYYXXYYXXYYYXYYYYYYYYXYXYXYYYYYYYXYYYXYYYXYYYXYYYYYYXYYYXYYYYYYXYXYXYXYYYXXYXYYXGGXXYYYYYXYYYYYXYYYYYYYYXYYYGYX D=54 S=133 G=13
I need to loop through the code above, to do this for many sequences in a fasta.
Did you copy the indent correct? If there're only 7 sequences, it will only print it 7 times. The main loop for j is to compare your base sequence to (j+1)th sequence. For each j, print it out once.
I made some additional comment in the script, make sure you copy it correct.