Consensus sequence software?
3
4
Entering edit mode
8.0 years ago

I am looking for a command line tool to get a consensus sequence, with a proper example usage not from a paper where everything is theoretical or from a software suite. I have reviewed options but quality of answers and documentation is bad. For example there is UGENE but they don't provide a command line. Also the Wikipedia page does not list any software or source code. A previous post in Biostars where the question was about Consensus Sequence and then people started suggesting Sequence Logos? I don't want sequence logo :)

Some people links to a program called USEARCH, but there is no command line usage example and it seems is not open source or the code is not available (documentation looks confusing too). Same happens with a software called CD-HIT. The biopython considers two algorithms: One is "dumb" and another is based in "Position Specific Score Matrices". However I don't want to go through python, or perl, ruby, etc. Another one called Consensus Maker from the HIV sequence database is web-based and not open source.

Situation is worst after reading the UGENE manual. They describe several algorithms:

• JalView (Default) - it is based on the JalView algorithm. Returns '+' if there are 2 characters with high frequency. Returns symbol in lower case if the symbol content in a row is lower than the specified threshold.
• ClustalW - emulates the ClustalW program and file format behavior.
• Levitsky - this algorithm is proposed by Victor Levitsky to calculate consensus of DNA alignments. At first, it collects global alignment frequencies for every symbol using extended (15 symbols) DNA alphabet. Then, for every column it selects the rarest symbol in the whole alignment with percentage in the column greater or equals to the threshold value.
• Strict - the algorithm returns gap character ('-') if symbol frequency in a column is lower than the threshold specified.

So there are lot of links and explanations but no command line?

sequence consensus • 10k views
1
Entering edit mode

Wow! What an exhaustive research!!! Wouldn't it have been just faster to code it yourself?

I compute the consensus sequence from the Position-Weight Matrices (PWM). Biopython can do it (yes I know you don't want it -that's why it's a comment and not an answer).

2
Entering edit mode
8.0 years ago

To extract consensus in UGENE with the command line interface you need to use UGENE Workflow Designer. There is an element that extracts consensus from each input alignment. You can create a workflow that reads alignment, extracts consensus and output consensus sequences to files. Since any workflow can be run from command line that's exactly what you need. Actually, you can find the pre-built workflow in the Samples bar of the Workflow Designer.

So it will be something like: ugenecl --task=extract_consensus --in=in.aln --out=out.fa

P.S. These functions will be released only in UGENE 1.14, that it why you did not find it in the documentation. But to get them right now, you can use the development snapshot from here

1
Entering edit mode
8.0 years ago
Asaf 8.9k

The program phrophecy of EMBOSS package should do it.

An example of an EMBOSS server can be found here, you can download and install it locally using command line.

1
Entering edit mode

If the OP just want a consensus sequence from a multiple sequence alignment, then they could try the EMBOSS "Alignment Consensus" tools:

1
Entering edit mode
8.0 years ago
Daniel ★ 3.9k

I wrote this perl script to take in a fasta alignment and output a full IUPAC consensus and also list 50 probes of X size which are weighted for A/C/T/G. It might help:

Usage is:

./degenerate_code_generator alignment_file.fasta window_size outfile.txt

#!/usr/bin/perl
use strict;
use warnings;

use Bio::Seq;
use Bio::AlignIO;
use Bio::Tools::IUPAC;

my $usage = "USAGE:\t./degenerate_code_generator REFERENCE_FILE.fasta Probe_size OUTFILE.txt\n"; my$infile = $ARGV[0] or die$usage . "\n";
my $probesize =$ARGV[1] or die $usage . "\n"; my$outfile = $ARGV[2] or die$usage . "\n";

my $alnio = Bio::AlignIO->new(-file =>$infile, -format => 'fasta');

open OUTFILE, ">$outfile"; select OUTFILE; my$cs;

while (my $aln =$alnio->next_aln) {
$cs =$aln->consensus_iupac;
my $seq = Bio::Seq->new(-seq =>$cs, -alphabet => 'dna');
}

$cs =~ s/-//g; print "<FULL>$cs\n";

my $score; my$winsize = $probesize; my$i = 1;
my $cs_len = length($cs);
my %hash;

while ($i <$cs_len) {
my $window = substr($cs, $i,$winsize);
$i++; foreach (split(//,$window)){
if(m/A|C|T|G|U/){
$score += "5"; }elsif(m/R|Y|S|W|K|M/){$score += "1";
}
}
$hash{$window} = $score;$score = 0;
}

my @keys = sort { $hash{$b} <=> $hash{$a} } keys(%hash);
my @vals = @hash{@keys};

my $c = 0; #print "Score\tProbe\n"; while ($c < 51){
print "$vals[$c]\t$keys[$c]\n";
\$c++;
}