Question: How To Filter Smith-Waterman Alignment (With Water From Emboss) According To Similarity Or Score?
1
gravatar for munch
5.6 years ago by
munch300
Munich
munch300 wrote:

I make pairwise alignment with ./water form the emboss package. I want all fasta identifier of the sequences which align with a similarity of > 60%. I can find no option to set a threshold for the identity, similarity or the alignment score. I can parse the output, but maybe you have any tips here, or I have overlooked something? Additional, the output does not contain the whole .fasta names of the match, only a part of it like:

output from ./water:

# Aligned_sequences: 2
# 1: 344TS28_contig130193 
# 2: 1648TS28_contig03869 
# Matrix: EDNAFULL
# Gap_penalty: 5.0
# Extend_penalty: 0.5
#
# Length: 732
# Identity:     287/732 (39.2%)
# Similarity:   287/732 (39.2%)
# Gaps:         424/732 (57.9%)
# Score: 765.5

fasta identifier in sequence:

>gene_1|GeneMark.hmm|344TS28_contig130193
...

Is this normal, or a bug (maybe it will be correct, when I replace the |)? Thank you in advance!

Edit: If I increase the costs for gap, the output will contain only sequences with a higher similarity, but I can't detect any fixed score-threshold.

Edit2: The problems with the fasta identifiers will go away when I replace all | with _ (sed -i 's/|/_/g' seq.fasta)

similarity alignment filtering • 2.9k views
ADD COMMENTlink modified 5.6 years ago by Ben2.0k • written 5.6 years ago by munch300
1

For 'fasta' format input EMBOSS will detect and parse a range of identifier formats, for example:

  • GCG with accession: >DB:ID ACC Des
  • GCG without accession: >DB:ID Des
  • NCBI fasta with 'gi': >gi|000|db|acc|id Des
  • NCBI-style fasta without 'gi': >db|acc|id Des

to extract the entry identifier. In these examples 'db' or 'DB' refers to the database name, 'acc' or 'ACC' is the entry accession, 'id' or 'ID' is the entry identifier or name, and 'Des' is the description.

If you do not want the fasta format header to be parsed to extract this information, but instead to use basic fasta format parsing where the first token on the header line is the identifier, in all cases then force the use of the 'pearson' format handling for the input by using '-sformat pearson' in your command line.

For more information about the handling of inputs in EMBOSS see the EMBOSS User Manual.

ADD REPLYlink written 5.6 years ago by Hamish3.1k
2
gravatar for Ben
5.6 years ago by
Ben2.0k
Edinburgh, UK
Ben2.0k wrote:

water is just reporting the mathematically optimal alignment of each query sequence; there's no native thresholding, nor should there be if you think about it.

I used this implementation myself a while ago and calculated my own E-values by fitting an extreme value distribution to simulated raw alignment scores (using HMM generated query sequences, but that was for a special application with known compositional bias). Calculating the significance of a local alignment is surprisingly well-documented on the BLAST wikipedia page, but I wouldn't attempt to use the suggested distribution parameters as they're dependent on your scoring matrix and Gap/Extend penalties.

ADD COMMENTlink written 5.6 years ago by Ben2.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1942 users visited in the last hour