Question

Local and global alignment combined

0

Entering edit mode

4.5 years ago

Mick ▴ 30

Hey guys,

for working with the sequencing data from our NGS-machine I need to look for short sequences within the reads, i.e. Primers, Barcodes etc.

I know there is a million tools for this out there, but for my own understanding I'm trying to write some python code, that performs this alignment for me.

I understand that you need the Needleman-Wunsch or Smith-Waterman Algorithm to do this. However I'm struggling to put the two together. Because for the Pattern I need global alignment because I'm looking for the full sequence of the barcode, but for the read I need local alignment because the pattern can be located at any point in the read.

I tried around with initializing the first row of the matrix to zeros and the first column to -1, -2, -3 etc. and this somewhat gives me the desired results. But when the pattern is very close to the end of the read sometimes it doesn't work out as I want.

As an example: read = "TTAACTGCCTGGTCTGGC" pattern = "TAACTGC"

My matrix looks like this:

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

[-1. 0. 0. -1. -1. -1. 0. -1. -1. -1. 0. -1. -1. 0. -1. 0. -1. -1. -1.]

[-2. -1. -1. 0. -1. -2. -1. -1. -2. -2. -1. -1. -2. -1. -1. -1. -1. -2. -2.]

[-3. -2. -2. -1. 0. -1. -2. -2. -2. -3. -2. -2. -2. -2. -2. -2. -2. -2. -3.]

[-4. -3. -3. -2. -1. 0. -1. -2. -2. -2. -3. -3. -3. -3. -2. -3. -3. -3. -2.]

[-5. -4. -3. -3. -2. -1. 0. -1. -2. -3. -2. -3. -4. -3. -3. -2. -3. -4. -3.]

[-6. -5. -4. -4. -3. -2. -1. 0. -1. -2. -3. -2. -3. -4. -4. -3. -2. -3. -4.]

[-7. -6. -5. -5. -4. -3. -2. -1. 0. -1. -2. -3. -3. -4. -4. -4. -3. -3. -3.]]

Then I would backtrack from 8 to 2 and thats my alignment.

My question is how do professional programs tackle this problem, I couldn't find information anywhere on how to combine local and global alignment for finding barcodes in NGS-reads. Any help would be much appreciated.

alignment sequence • 1.9k views

ADD COMMENT • link 4.5 years ago by Mick ▴ 30

0

Entering edit mode

Semi-global sequence alignment: What if there are two equally greatest elements in the last row? - I guess you need this.

ADD REPLY • link 4.5 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

Also look at this post: Could someone suggest a fast and efficient implementation of a glocal (semi-global) sequence aligner?

ADD REPLY • link 4.5 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Thank you guys for the links. The images in the first link where actually very helpful for my understanding.

I think I have a better understanding now, of what I need to do. I have another question. For the semi-global alignment to work, do i need a positive bonus for matching characters like you do in the smith-waterman-algorithm or can I just use negative penalties for mismatch, inserts and deletions like the needleman-wunsch-algorithm and give a 0 for matching characters.

Thanks. :)

ADD REPLY • link 4.5 years ago by Mick ▴ 30

0

Entering edit mode

It was long time ago since I used these algorithms last time, but I am pretty sure that I've used the same scoring scheme for all of them. Usually score for match is bigger than score for others since you want to find the path with the biggest score, but I believe the difference between them matters, not the actual values themself (I can make parallels with likelihoods ratios, but it will overcomplicate things). So use any score you want, but keep score for match higher. Here is explained better: https://www.cs.cmu.edu/~02710/Lectures/ScoringMatrices2015.pdf

ADD REPLY • link 4.5 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

Thank you guys for your replys. Thank you for the recommendation of the emboss tool. As i said, right now I'm trying to figure out how exactly these algorithms work, thats why I'm trying to write the code myself but when I start to work on my own pipeline I'll definitely consider using it.

So I decided to use positive score for matches and negative scores for mismatch and indels. I think for the fitted alignment, where you look for a smaller sequence in a longer sequence it doesn't really make a difference, but when you allow overlapping alignment (i.e. reverse primer at the end of the read, that might be truncated because it spans the end of the read) then positive score for matching characters would lead to a higher score for longer matches as opposed to really short matches at the end of the read.

ADD REPLY • link 4.5 years ago by Mick ▴ 30

0

Entering edit mode

Please use ADD COMMENT or ADD REPLY fields for comments and replies. Answer field is reserved for solutions that can help resolve the original question.

ADD REPLY • link 4.5 years ago by Sej Modha 5.3k

0

Entering edit mode

Whatever works is good =) if you avoid 0-trap - you are fine

ADD REPLY • link 4.5 years ago by German.M.Demidov ★ 2.9k

score 1 · Answer 1 · 2019-10-07

As @kuckunniwid commented, you are looking for semi-global alignment (a.k.a glocal).

I recommend the needle tool from EMBOSS package. By default, needle performs semi-global alignment (i.e., end gap penalties are disabled). You can play with the tool on-line or install it locally (for example, in Ubuntu/Debian: apt-get install emboss). The tool is written in C++ and will be faster than any Python implementation.

Semi-global alignment of two sequences from your post:

########################################
# Program: needle
# Rundate: Mon  7 Oct 2019 18:21:25
# Commandline: needle
#    -auto
#    -stdout
#    -asequence emboss_needle-I20191007-182122-0130-64098088-p2m.asequence
#    -bsequence emboss_needle-I20191007-182122-0130-64098088-p2m.bsequence
#    -datafile EDNAFULL
#    -gapopen 10.0
#    -gapextend 0.5
#    -endopen 10.0
#    -endextend 0.5
#    -aformat3 pair
#    -snucleotide1
#    -snucleotide2
# Align_format: pair
# Report_file: stdout
########################################

#=======================================
#
# Aligned_sequences: 2
# 1: EMBOSS_001
# 2: EMBOSS_001
# Matrix: EDNAFULL
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 18
# Identity:       7/18 (38.9%)
# Similarity:     7/18 (38.9%)
# Gaps:          11/18 (61.1%)
# Score: 35.0
# 
#
#=======================================

EMBOSS_001         1 TTAACTGCCTGGTCTGGC     18
                      |||||||          
EMBOSS_001         1 -TAACTGC----------      7


#---------------------------------------
#---------------------------------------

Global alignment:

########################################
# Program: needle
# Rundate: Mon  7 Oct 2019 18:20:30
# Commandline: needle
#    -auto
#    -stdout
#    -asequence emboss_needle-I20191007-182027-0875-76717346-p2m.asequence
#    -bsequence emboss_needle-I20191007-182027-0875-76717346-p2m.bsequence
#    -datafile EDNAFULL
#    -gapopen 10.0
#    -gapextend 0.5
#    -endweight
#    -endopen 10.0
#    -endextend 0.5
#    -aformat3 pair
#    -snucleotide1
#    -snucleotide2
# Align_format: pair
# Report_file: stdout
########################################

#=======================================
#
# Aligned_sequences: 2
# 1: EMBOSS_001
# 2: EMBOSS_001
# Matrix: EDNAFULL
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 18
# Identity:       7/18 (38.9%)
# Similarity:     7/18 (38.9%)
# Gaps:          11/18 (61.1%)
# Score: 10.5
# 
#
#=======================================

EMBOSS_001         1 TTAACTGCCTGGTCTGGC     18
                      |||||          ||
EMBOSS_001         1 -TAACT----------GC      7


#---------------------------------------
#---------------------------------------