Question: Extracting pairwise alignment info
gravatar for Kevin D
3.0 years ago by
Kevin D30
INRA France
Kevin D30 wrote:

Hello everyone,

I am trying to write a bash script to extract information from a pairwise alignment text file output (Emboss Needle). In the txt output, there is a header as follow:

# Aligned_sequences: 2
# 1: RC0505_OB1
# 2: RC0505_OB2
# Matrix: EDNAFULL
# Gap_penalty: 10.0
# Extend_penalty: 0.5
# Length: 866
# Identity:     804/866 (92.8%)
# Similarity:   804/866 (92.8%)
# Gaps:          53/866 ( 6.1%)
# Score: 3881.5

I want to create a table with only one line such as i get the sequence ID, length, identity and gaps info.

So in this example, I basically want to create a table like this:

RC0505 866 804 53

I manage to extract the full line containing the Identity info for instance, using sed:

sed -n '/Identity/p' aln.txt

But then I can't figure out how to extract just the "804" part and put it in a table. And do this for the other parameters. I think that grep sed and awk functions could help but I don't know how to use them efficiently.

Any idea ?

Many thanks

ADD COMMENTlink modified 3.0 years ago by 5heikki8.6k • written 3.0 years ago by Kevin D30
gravatar for Joe
3.0 years ago by
United Kingdom
Joe16k wrote:

EDIT After more info from the OP, this solution doesn't work in all cases where the fractions digit lengths vary.

This is neither pretty nor elegant, but it works. A more talented/persistent programmer will almost certainly be able to do it in one line without needing to recall cat each time.


# Usage: $ bash file

percID=$(cat $1 | grep "Identity" | cut -d ' ' -f 7)
name=$(cat $1 | grep "1:" | cut -d ' ' -f 3)
length=$(cat $1 | grep "Length" | cut -d ' ' -f 3)
gaps=$(cat $1 | grep "Gaps" | cut -d ' ' -f 12)

echo -e "$name\t$length\t${percID%/*}\t${gaps%/*}"

The script prints the content of the file with cat $1 ($1 is a built-in bash variable that corresponds to the first argument on the commandline - $0 is the script name itself). It then pipes ( | ) the file to grep to find the distinguishing feature of a line (the strings "Identity", "1:", "Length" and "Gaps".) This 'filtered' file is then passed in to cut which is told the delimiter in your file is standard whitespace -d ' ' and which field you want -f.

The exception is if you were trying to cut a string that contained whitespace itself e.g. "Hello World" which cut would see as 2 different columns ["Hello", "World"].

Each of these piped commands is wrapped in variable=$(...) which tells Bash to evaluate the contents of the bracket, and save it to a variable, rather than just saving the string. This is called command substitution.

echo -e then simply prints the variables tab-separated with \t (the -e flag of echo is needed so that it interprets special characters as such.

As per your requested output format which contains no denominator in the fraction, one can use parameter expansion. A 'normal' variable recall would simply look like: $variable(usually surrounded by " ", and optionally { }).

If however, we recall the variable and use parameter expansion, we can strip parts of the variable name out e.g.:

$gaps == 53/866

but, ${gaps%/*} == 53 because we tell bash to strip after the first / we encounter (%/) and strip everything *.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Joe16k

Thank you very much! It also works well! Now I see better how to use the grep function coupled with cut. Have a nice day!

ADD REPLYlink written 3.0 years ago by Kevin D30

cut, grep, sed, awk, sort, uniq, cat, xargs, join.. (and more besides).. all super useful for some quick and dirty chopping up of text :)

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Joe16k

Yes, I'll study them a lot, I guess I coud save so much time with them. Anyway, thanks for your coment, it's my first step toward bash script and it helped me understand the way it works!

ADD REPLYlink written 3.0 years ago by Kevin D30
gravatar for 5heikki
3.0 years ago by
5heikki8.6k wrote:

By no means perfect but could be of help..

awk 'BEGIN{FS=" "} NR==4 || NR==11 || NR==13 {printf "%s\t", $3}' input.txt
ADD COMMENTlink written 3.0 years ago by 5heikki8.6k

Thank you, your solution is also nice because it takes the whole fraction "804/866" or "53/866" regardless of the number of digits. I mean, my final goal is to run the script on several Emboss output files and sometimes match numbers or gaps numbers have 2 or 3 or 4 digits so I guess jrj.healey's solution will not work fine in those case since his script seems to count the position of the fraction in this particular case where the match numbers has 3 digits and the gaps number has 2 digits. Am I understanding right?

ADD REPLYlink written 3.0 years ago by Kevin D30

If you want the whole fraction (that was not what you asked for in your example output), you can modify my script to easily output the full fraction by removing the %/* parts in the last line. I've edited my answer to explain what it's doing as well.


echo -e "$name\t$length\t${percID%/*}\t${gaps%/*}"

Current output:

RC0505_OB1  866 804 53


echo -e "$name\t$length\t$percID\t$gaps"


RC0505_OB1  866 804/866 53/866
ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Joe16k

Ok! Actually the "issue" is not of having the whole fraction. Let's say I have a second Emboss output file and the parameters are now:

# Length: 1076
# Identity:    1062/1076 (98.7%)
# Similarity:  1061/1076 (98.6%)
# Gaps:           0/1076 ( 0.0%)
# Score: 5248.0

Then the script will return:

RC0058_OB1      1076    (98.7%)

I think this is because the script assumes the Identity score always starts at the 7th position. In fact this is not true for every emboss output file but for sure you couldn't know as I didn't mention this.

ADD REPLYlink written 3.0 years ago by Kevin D30

Ah yeah I see the problem. My approach probably isn't going to work overall then in that case. Sorry!

ADD REPLYlink written 3.0 years ago by Joe16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 662 users visited in the last hour