Question

How To Filter Smith-Waterman Alignment (With Water From Emboss) According To Similarity Or Score?

1

Entering edit mode

11.8 years ago

munch ▴ 310

I make pairwise alignment with ./water form the emboss package. I want all fasta identifier of the sequences which align with a similarity of > 60%. I can find no option to set a threshold for the identity, similarity or the alignment score. I can parse the output, but maybe you have any tips here, or I have overlooked something? Additional, the output does not contain the whole .fasta names of the match, only a part of it like:

output from ./water:

# Aligned_sequences: 2
# 1: 344TS28_contig130193 
# 2: 1648TS28_contig03869 
# Matrix: EDNAFULL
# Gap_penalty: 5.0
# Extend_penalty: 0.5
#
# Length: 732
# Identity:     287/732 (39.2%)
# Similarity:   287/732 (39.2%)
# Gaps:         424/732 (57.9%)
# Score: 765.5

fasta identifier in sequence:

>gene_1|GeneMark.hmm|344TS28_contig130193
...

Is this normal, or a bug (maybe it will be correct, when I replace the |)? Thank you in advance!

Edit: If I increase the costs for gap, the output will contain only sequences with a higher similarity, but I can't detect any fixed score-threshold.

Edit2: The problems with the fasta identifiers will go away when I replace all | with _ (sed -i 's/|/_/g' seq.fasta)

alignment similarity filtering • 5.1k views

ADD COMMENT • link updated 11.8 years ago by Ben ★ 2.0k • written 11.8 years ago by munch ▴ 310

1

Entering edit mode

For 'fasta' format input EMBOSS will detect and parse a range of identifier formats, for example:

GCG with accession: >DB:ID ACC Des
GCG without accession: >DB:ID Des
NCBI fasta with 'gi': >gi|000|db|acc|id Des
NCBI-style fasta without 'gi': >db|acc|id Des

to extract the entry identifier. In these examples 'db' or 'DB' refers to the database name, 'acc' or 'ACC' is the entry accession, 'id' or 'ID' is the entry identifier or name, and 'Des' is the description.

If you do not want the fasta format header to be parsed to extract this information, but instead to use basic fasta format parsing where the first token on the header line is the identifier, in all cases then force the use of the 'pearson' format handling for the input by using '-sformat pearson' in your command line.

For more information about the handling of inputs in EMBOSS see the EMBOSS User Manual.

ADD REPLY • link 11.8 years ago by Hamish ★ 3.3k

score 2 · Answer 1 · 2013-10-02

water is just reporting the mathematically optimal alignment of each query sequence; there's no native thresholding, nor should there be if you think about it.

I used this implementation myself a while ago and calculated my own E-values by fitting an extreme value distribution to simulated raw alignment scores (using HMM generated query sequences, but that was for a special application with known compositional bias). Calculating the significance of a local alignment is surprisingly well-documented on the BLAST wikipedia page, but I wouldn't attempt to use the suggested distribution parameters as they're dependent on your scoring matrix and Gap/Extend penalties.