Question: Return number of mismatches for multiple sequences
0
gravatar for genya35
13 months ago by
genya3520
genya3520 wrote:

Hello,

I'm looking for an automated way to align multiple fasta files ~(50,000) to a custom made reference (fasta) and return the number of mismatched for each sequence. Each sequence and the reference are about 295 bp long. I have used bwa and bowtie to align sequences in the past but not sure if they can return the number of mismatches. I'm looking for a simplest way to accomplish this task.

Thanks your advice in advance

alignment • 299 views
ADD COMMENTlink modified 13 months ago by dariober11k • written 13 months ago by genya3520

You are looking for an 'all-vs-one' pairwise alignment?

Off the top of my head I can't think of one that I know for sure gives the number of mismatches, but there are plenty of aligners to choose from: https://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Pairwise_alignment

It may just be a case of using a heavy duty program to do the aligning, and a small script to count the mismatches after the fact (but I'm sure theres a program that will have what you need.

ADD REPLYlink modified 13 months ago • written 13 months ago by Joe16k
0
gravatar for dariober
13 months ago by
dariober11k
WCIP | Glasgow | UK
dariober11k wrote:

A while back I had a similar problem and I ended up writing SequenceMatcher, a wrapper around the alignment tools in BioJava:

Basically:

java -jar SequenceMatcher.jar match -a reference.fa -b queries.fa

The output gives you the number of mismatches (edit distance in fact) in column NM plus a number of similarity distances (Levenshtein, Hamming, Jaro-Winkler) and other metrics:

seq_A  seq_B  strand  LD  HD  JWD   len_A  len_B  pos  NM  aln_score  n_ident  n_sim  pct_ident  len_aln  alnA                                     alnB
ref    q01    +       8   8   0.95  25     33     1    8   107        25       25     0.76       33       ATCAACTAACGAGCAAAATAACCAG--------        ATCAACTAACGAGCAAAATAACCAGCTAACATC
ref    q02    +       18  24  0.78  25     33     1    19  0          14       14     0.36       39       -----ATCAACT---------AACGAGCAAAATAACCAG  ACAATATTAACTTTAAATGTAAATGGACTAAAT------
ref    q03    +       17  25  0.78  25     33     1    17  -3         17       17     0.50       34       A--TCAACTAACGAG-CAA-AATAACCAG-----       AATTGGA-TAAAGAGTCAAGACCCATCAGTGTGC
ref    q04    +       16  24  0.81  25     33     1    18  9          17       17     0.49       35       ATCAACTAAC---GA-GCAAAATAACCA------G      AGAGACACACATAGACTCAAAATAA--AAGGATGG
ref    q05    +       18  26  0.00  25     33     1    20  -15        15       15     0.41       37       ATCAACTAACG-------AGCAA----AATAAC-CAG    --CAAAAAAAGTAGGGGTTGCAATCCTAGT--CTCTG
ref    q06    +       17  27  0.77  25     33     1    19  13         17       17     0.46       37       ATCAACTAACGAGCAAAATAA--CCA----------G    -TCAA---AAGAGACAAAGAAGGCCATTACATAATGG
ref    q07    +       8   8   0.95  25     33     1    8   107        25       25     0.76       33       ATCAACTAACGAGCAAAATAACCAG--------        ATCAACTAACGAGCAAAATAACCAGCTAACATC

The advantage over other aligners is that every sequence in query.fa is optimally aligned to every sequence in reference.fa (many to one, in your case) regardless of the number of mismatches. I.e. there is no filtering for alignment score or similar and no drop in alignment quality to favour speed. (This comes at the expense of computational efficiency, of course, but for ~50000 sequence vs 1 of length ~300bp it should be fast enough)

PS: bwa and bowtie should give the number of mismatches/edit distance in the NM tag

ADD COMMENTlink written 13 months ago by dariober11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1792 users visited in the last hour