Question: Code For Generating 'Consensus' Sequence From Pairwise Alignment
1
gravatar for Woa
5.7 years ago by
Woa2.7k
United States
Woa2.7k wrote:

Hi All,

Let's say I have the following toy PAIRWISE GLOBAL alignment of two sequences:

aaaa----fghijk
||||.......***
aaaabcde---iij

I wish to generate the following 'consensus' or combined sequence

aaaabcdefghiij

The methodology is:where ever along the length of alignment, there is identity choose any one of the identical characters, when there is a gap in any one of the sequences, choose the non gap character from the other sequence, and when there is a mismatch, choose the corresponding character from the lower sequence.

The actual alignments are derived using Emboss "Needle". Are there scripts/modules available in (Bio)Perl / Python or any other tool to do this?

Thanks in advance

pairwise consensus • 3.3k views
ADD COMMENTlink modified 5.7 years ago by SES8.2k • written 5.7 years ago by Woa2.7k

Have you tried this? https://metacpan.org/module/Bio::AlignIO::emboss

ADD REPLYlink written 5.7 years ago by jing10
2
gravatar for Niek De Klein
5.7 years ago by
Niek De Klein2.5k
Netherlands
Niek De Klein2.5k wrote:

I don't know an existing module, but in python you can do it with this function

def consensus(sequence_1, sequence_2):
    consensus_sequence = ''
    for index, amino_acid in enumerate(sequence_1):
        if amino_acid == sequence_2[index]:
            consensus_sequence += amino_acid
        elif amino_acid == '-' and sequence_2[index] != '-':
            consensus_sequence += sequence_2[index]
        elif sequence_2[index] == '-' and amino_acid != '-':
            consensus_sequence += amino_acid
        elif amino_acid != sequence_2[index]:
            consensus_sequence += sequence_2[index]
    return consensus_sequence

Then test it with

sequence_1 = 'aaaa----fghijk'
sequence_2 = 'aaaabcde---iij'
consensus_seq = consensus(sequence_1, sequence_2)
print consensus_seq, consensus_seq == 'aaaabcdefghiij'           # gives aaaabcdefghiij True
ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Niek De Klein2.5k

This is a very useful piece of code. I want to ask if the consensus is to be desired for a multiple sequences alignment data, how will it be implemented?

ADD REPLYlink written 7 months ago by mdsiddra30

Haven't tested for edge cases but can try something like this

def consensus(*sequences):
    if(len(sequences) <= 1):
        raise RuntimeError('Only '+str(len(sequences))+' sequences given, need at least 2')

    consensus_sequence = ''
    for index, amino_acid in enumerate(sequences[0]):
        amino_acids = set([amino_acid])
        for sequence in sequences[1:]:
            amino_acids.add(sequence[index])
        amino_acids.discard('-')

        if len(amino_acids) == 1:
            consensus_sequence += amino_acids.pop()
        else:
            consensus_sequence += '-'
    return consensus_sequence
sequence_1 = 'aaaa----fghijk----'
sequence_2 = 'aaaabcde---iij----'
sequence_3 = 'aaaabcde---iij-l-n'
sequence_4 = 'aaaabcde---iijq-o-'
consensus_seq = consensus(sequence_1, sequence_2, sequence_3, sequence_4)
print (consensus_seq)
ADD REPLYlink modified 7 months ago • written 7 months ago by Niek De Klein2.5k

I was using a sequence file in fasta format. First I used it to extract sequences in a list, as follows:

fasta = {}
with open('seq.fas') as file_one:
    for line in file_one:
        line = line.strip()
        if not line:
            continue
        if line.startswith(">"):
            active_sequence_name = line[1:]
            if active_sequence_name not in fasta:
                fasta[active_sequence_name] = []
            continue
        sequence = line
        fasta[active_sequence_name].append(sequence)
result_values = []
values = list(fasta.values()) # List of values
for val in values:
    result_values.append(val)
print (result_values)

and then I used your code for generating consensus:

def consensus(*sequences):
    if(len(sequences) <= 1):
        raise RuntimeError('Only '+str(len(sequences))+' sequences given, need at least 2')

    consensus_sequence = ''
    for index, amino_acid in enumerate(sequences[0]):
        amino_acids = set([amino_acid])
        for sequence in sequences[1:]:
            amino_acids.add(sequence[index])
        amino_acids.discard('-')

        if len(amino_acids) == 1:
            consensus_sequence += amino_acids.pop()
        else:
            consensus_sequence += '-'
    return consensus_sequence
consensus_seq = consensus(result_values)
print (consensus_seq)

Your code works well for the above condition with 4 sequences but it does not take the list as arguement which I as trying. Can you suggest something for this?

ADD REPLYlink written 7 months ago by mdsiddra30
1
gravatar for SES
5.7 years ago by
SES8.2k
Vancouver, BC
SES8.2k wrote:

Here is a one line BioPerl solution:

$ perl -MBio::AlignIO -e 'my $aln = Bio::AlignIO->new(-file => shift); my $str = $aln->next_aln(); print $str->consensus_string(50)' aln.fas
aaaabcdefghiij

If you want to consider DNA/RNA alignments, you will need to incorporate ambiguity codes. See How To Merge Consensus Motif To Degenerate Motif ? [Motif Merge Or Motif Clustering] for an example of how to expand this code into something for DNA/RNA.

ADD COMMENTlink written 5.7 years ago by SES8.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1468 users visited in the last hour