Question

Return number of mismatches for multiple sequences

0

Entering edit mode

5.3 years ago

genya35 ▴ 40

Hello,

I'm looking for an automated way to align multiple fasta files ~(50,000) to a custom made reference (fasta) and return the number of mismatched for each sequence. Each sequence and the reference are about 295 bp long. I have used bwa and bowtie to align sequences in the past but not sure if they can return the number of mismatches. I'm looking for a simplest way to accomplish this task.

Thanks your advice in advance

alignment • 1.4k views

ADD COMMENT • link updated 5.3 years ago by dariober 14k • written 5.3 years ago by genya35 ▴ 40

0

Entering edit mode

You are looking for an 'all-vs-one' pairwise alignment?

Off the top of my head I can't think of one that I know for sure gives the number of mismatches, but there are plenty of aligners to choose from: https://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Pairwise_alignment

It may just be a case of using a heavy duty program to do the aligning, and a small script to count the mismatches after the fact (but I'm sure theres a program that will have what you need.

ADD REPLY • link 5.3 years ago by Joe 21k

score 0 · Answer 1 · 2019-01-07

A while back I had a similar problem and I ended up writing SequenceMatcher, a wrapper around the alignment tools in BioJava:

Basically:

java -jar SequenceMatcher.jar match -a reference.fa -b queries.fa

The output gives you the number of mismatches (edit distance in fact) in column NM plus a number of similarity distances (Levenshtein, Hamming, Jaro-Winkler) and other metrics:

seq_A  seq_B  strand  LD  HD  JWD   len_A  len_B  pos  NM  aln_score  n_ident  n_sim  pct_ident  len_aln  alnA                                     alnB
ref    q01    +       8   8   0.95  25     33     1    8   107        25       25     0.76       33       ATCAACTAACGAGCAAAATAACCAG--------        ATCAACTAACGAGCAAAATAACCAGCTAACATC
ref    q02    +       18  24  0.78  25     33     1    19  0          14       14     0.36       39       -----ATCAACT---------AACGAGCAAAATAACCAG  ACAATATTAACTTTAAATGTAAATGGACTAAAT------
ref    q03    +       17  25  0.78  25     33     1    17  -3         17       17     0.50       34       A--TCAACTAACGAG-CAA-AATAACCAG-----       AATTGGA-TAAAGAGTCAAGACCCATCAGTGTGC
ref    q04    +       16  24  0.81  25     33     1    18  9          17       17     0.49       35       ATCAACTAAC---GA-GCAAAATAACCA------G      AGAGACACACATAGACTCAAAATAA--AAGGATGG
ref    q05    +       18  26  0.00  25     33     1    20  -15        15       15     0.41       37       ATCAACTAACG-------AGCAA----AATAAC-CAG    --CAAAAAAAGTAGGGGTTGCAATCCTAGT--CTCTG
ref    q06    +       17  27  0.77  25     33     1    19  13         17       17     0.46       37       ATCAACTAACGAGCAAAATAA--CCA----------G    -TCAA---AAGAGACAAAGAAGGCCATTACATAATGG
ref    q07    +       8   8   0.95  25     33     1    8   107        25       25     0.76       33       ATCAACTAACGAGCAAAATAACCAG--------        ATCAACTAACGAGCAAAATAACCAGCTAACATC

The advantage over other aligners is that every sequence in query.fa is optimally aligned to every sequence in reference.fa (many to one, in your case) regardless of the number of mismatches. I.e. there is no filtering for alignment score or similar and no drop in alignment quality to favour speed. (This comes at the expense of computational efficiency, of course, but for ~50000 sequence vs 1 of length ~300bp it should be fast enough)

PS: bwa and bowtie should give the number of mismatches/edit distance in the NM tag