How To Filter Smith-Waterman Alignment (With Water From Emboss) According To Similarity Or Score?
1
1
Entering edit mode
10.6 years ago
munch ▴ 310

I make pairwise alignment with ./water form the emboss package. I want all fasta identifier of the sequences which align with a similarity of > 60%. I can find no option to set a threshold for the identity, similarity or the alignment score. I can parse the output, but maybe you have any tips here, or I have overlooked something? Additional, the output does not contain the whole .fasta names of the match, only a part of it like:

output from ./water:

# Aligned_sequences: 2
# 1: 344TS28_contig130193 
# 2: 1648TS28_contig03869 
# Matrix: EDNAFULL
# Gap_penalty: 5.0
# Extend_penalty: 0.5
#
# Length: 732
# Identity:     287/732 (39.2%)
# Similarity:   287/732 (39.2%)
# Gaps:         424/732 (57.9%)
# Score: 765.5

fasta identifier in sequence:

>gene_1|GeneMark.hmm|344TS28_contig130193
...

Is this normal, or a bug (maybe it will be correct, when I replace the |)? Thank you in advance!

Edit: If I increase the costs for gap, the output will contain only sequences with a higher similarity, but I can't detect any fixed score-threshold.

Edit2: The problems with the fasta identifiers will go away when I replace all | with _ (sed -i 's/|/_/g' seq.fasta)

alignment similarity filtering • 4.6k views
ADD COMMENT
1
Entering edit mode

For 'fasta' format input EMBOSS will detect and parse a range of identifier formats, for example:

  • GCG with accession: >DB:ID ACC Des
  • GCG without accession: >DB:ID Des
  • NCBI fasta with 'gi': >gi|000|db|acc|id Des
  • NCBI-style fasta without 'gi': >db|acc|id Des

to extract the entry identifier. In these examples 'db' or 'DB' refers to the database name, 'acc' or 'ACC' is the entry accession, 'id' or 'ID' is the entry identifier or name, and 'Des' is the description.

If you do not want the fasta format header to be parsed to extract this information, but instead to use basic fasta format parsing where the first token on the header line is the identifier, in all cases then force the use of the 'pearson' format handling for the input by using '-sformat pearson' in your command line.

For more information about the handling of inputs in EMBOSS see the EMBOSS User Manual.

ADD REPLY
2
Entering edit mode
10.6 years ago
Ben ★ 2.0k

water is just reporting the mathematically optimal alignment of each query sequence; there's no native thresholding, nor should there be if you think about it.

I used this implementation myself a while ago and calculated my own E-values by fitting an extreme value distribution to simulated raw alignment scores (using HMM generated query sequences, but that was for a special application with known compositional bias). Calculating the significance of a local alignment is surprisingly well-documented on the BLAST wikipedia page, but I wouldn't attempt to use the suggested distribution parameters as they're dependent on your scoring matrix and Gap/Extend penalties.

ADD COMMENT

Login before adding your answer.

Traffic: 2846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6