Question: Return number of mismatches for multiple sequences
0
gravatar for genya35
4 months ago by
genya3510
genya3510 wrote:

Hello,

I'm looking for an automated way to align multiple fasta files ~(50,000) to a custom made reference (fasta) and return the number of mismatched for each sequence. Each sequence and the reference are about 295 bp long. I have used bwa and bowtie to align sequences in the past but not sure if they can return the number of mismatches. I'm looking for a simplest way to accomplish this task.

Thanks your advice in advance

alignment • 159 views
ADD COMMENTlink modified 4 months ago by dariober10k • written 4 months ago by genya3510

You are looking for an 'all-vs-one' pairwise alignment?

Off the top of my head I can't think of one that I know for sure gives the number of mismatches, but there are plenty of aligners to choose from: https://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Pairwise_alignment

It may just be a case of using a heavy duty program to do the aligning, and a small script to count the mismatches after the fact (but I'm sure theres a program that will have what you need.

ADD REPLYlink modified 4 months ago • written 4 months ago by jrj.healey12k
0
gravatar for dariober
4 months ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

A while back I had a similar problem and I ended up writing SequenceMatcher, a wrapper around the alignment tools in BioJava:

Basically:

java -jar SequenceMatcher.jar match -a reference.fa -b queries.fa

The output gives you the number of mismatches (edit distance in fact) in column NM plus a number of similarity distances (Levenshtein, Hamming, Jaro-Winkler) and other metrics:

seq_A  seq_B  strand  LD  HD  JWD   len_A  len_B  pos  NM  aln_score  n_ident  n_sim  pct_ident  len_aln  alnA                                     alnB
ref    q01    +       8   8   0.95  25     33     1    8   107        25       25     0.76       33       ATCAACTAACGAGCAAAATAACCAG--------        ATCAACTAACGAGCAAAATAACCAGCTAACATC
ref    q02    +       18  24  0.78  25     33     1    19  0          14       14     0.36       39       -----ATCAACT---------AACGAGCAAAATAACCAG  ACAATATTAACTTTAAATGTAAATGGACTAAAT------
ref    q03    +       17  25  0.78  25     33     1    17  -3         17       17     0.50       34       A--TCAACTAACGAG-CAA-AATAACCAG-----       AATTGGA-TAAAGAGTCAAGACCCATCAGTGTGC
ref    q04    +       16  24  0.81  25     33     1    18  9          17       17     0.49       35       ATCAACTAAC---GA-GCAAAATAACCA------G      AGAGACACACATAGACTCAAAATAA--AAGGATGG
ref    q05    +       18  26  0.00  25     33     1    20  -15        15       15     0.41       37       ATCAACTAACG-------AGCAA----AATAAC-CAG    --CAAAAAAAGTAGGGGTTGCAATCCTAGT--CTCTG
ref    q06    +       17  27  0.77  25     33     1    19  13         17       17     0.46       37       ATCAACTAACGAGCAAAATAA--CCA----------G    -TCAA---AAGAGACAAAGAAGGCCATTACATAATGG
ref    q07    +       8   8   0.95  25     33     1    8   107        25       25     0.76       33       ATCAACTAACGAGCAAAATAACCAG--------        ATCAACTAACGAGCAAAATAACCAGCTAACATC

The advantage over other aligners is that every sequence in query.fa is optimally aligned to every sequence in reference.fa (many to one, in your case) regardless of the number of mismatches. I.e. there is no filtering for alignment score or similar and no drop in alignment quality to favour speed. (This comes at the expense of computational efficiency, of course, but for ~50000 sequence vs 1 of length ~300bp it should be fast enough)

PS: bwa and bowtie should give the number of mismatches/edit distance in the NM tag

ADD COMMENTlink written 4 months ago by dariober10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 593 users visited in the last hour