Question

Extracting Matching Subsequence From Pairwise Alignment

0

Entering edit mode

11.0 years ago

Woa ★ 2.9k

Hello All,

I've a pairwise global alignemnet of two DNA sequences generated by the program NEEDLE of EMBOSS package. I wish to extract the sub-sequence that matches/aligns to a given region of the other sequence. In this alignment (Pastebin Link) the given region (actually the CDS) falls between base number 24:485 in the original sequence with ID 'XM_001005073.' I wish to extract the sub-sequence in the sequence ID 'Homolog' that aligns with that 24:485 region of the other sequence.

I'm using Bioperl to parse the alignment. I find out the the alignment column numbers corresponding to 24:485 region in the particular sequence, using 'column_from_residue_number'. Then I extract the sub-sequence from the 'aligned' other sequence(containing gaps) using the corresponding column numbers. Finally I remove the gap characters.

Am I doing this thing correctly and are there any pitfalls ? Is there any better way to do it by (Bio)Perl/Python? The code goes here:

use strict;
use warnings;
use Bio::AlignIO;

# read in an alignment generated by the EMBOSS program Needle
my $in = new Bio::AlignIO(-format => 'emboss',
                  -file   => 'test_needle.aln');

while( my $aln = $in->next_aln ) {
          #Seqnames: 'XM_001005073.'(CDS:24-485),'Homolog'
          my ($cds_start,$cds_end)=(24,485);#
          my $col_cdsstart = $aln->column_from_residue_number( 'XM_001005073.', $cds_start);
          my $col_cdsend= $aln->column_from_residue_number( 'XM_001005073.', $cds_end);

          foreach my $seq ($aln->each_seq) {
                    if($seq->id() eq 'Homolog'){
                        my $homolog_cds=$seq->subseq($col_cdsstart,$col_cdsend);
                         $homolog_cds=~s/\-//g;
                        print $homolog_cds,"\n";
                    }

        }
}

Thanks in advance

alignment bioperl • 3.3k views

ADD COMMENT • link 11.0 years ago by Woa ★ 2.9k

0

Entering edit mode

I do not get the question, could you provide an small input and output example ?

ADD REPLY • link 11.0 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

The input is the alignment itself,as given in the pastebin link. The region 24:485 in the Sequence XM_001005073 is highlighted in the Genbank record http://img708.imageshack.us/img708/2469/cds24485.png. I wish to extract the subsequence from the Sequence named "Homolog' that aligns to the 24:485 region.In the alignment that corresponds to base number 1:308 in the 'Homolog' sequence. The output would therefore be subsequence 1:308.

ADD REPLY • link 11.0 years ago by Woa ★ 2.9k

0

Entering edit mode

The typical output of gapped and gaps-removed sequence http://pastebin.com/pfPn23Ud I've to use the gaps-removed sequence only.

ADD REPLY • link 11.0 years ago by Woa ★ 2.9k