Question: Taking Out Specific Columns From An Alignment
gravatar for figo
7.8 years ago by
figo200 wrote:

Hi All,

I wanted to write a script to remove certain columns which are in array from an alignment fasta file. I wrote a perl script using bioperl which is like this

use Bio::Seq;
use Bio::SeqIO;
use Bio::AlignIO;
use Bio::Align::AlignI;
use Bio::SimpleAlign;
my $str = Bio::AlignIO->new('-file' => 'aligned.fasta');
my $aln = $str->next_aln();
$out = $aln->remove_columns([4,5]);
my $aio = Bio::AlignIO->new(-fh=>\*STDOUT,-format=>"fasta");

Input data (example file)


But I am having problem that the remove _columns function only works for 2 columns only when I try to put say (e.g [4,5,6]) it doesn't work. So any one can tell me to come up with a solution may be in perl not using bioperl or something else.

perl alignment • 3.4k views
ADD COMMENTlink modified 7.8 years ago by Sukhdeep Singh10k • written 7.8 years ago by figo200

Will you reformat your code so it can be read? Also can you provide a small piece of data to test?

ADD REPLYlink modified 7.8 years ago • written 7.8 years ago by Zev.Kronenberg11k

formatting fixed.

ADD REPLYlink written 7.8 years ago by brentp23k

looks like remove_columns function is used to remove certain types of alignments. From the API docs:

"Creates an aligment with columns removed corresponding to the specified criteria: 'match'|'weak'|'strong'|'mismatch'|'gaps'"

It doesn't remove column by position. There is a slice function to get subcolumns of the alignment. You can probably use that to come up with something.

ADD REPLYlink modified 7.8 years ago • written 7.8 years ago by Damian Kao15k

The perl solutions below will work if you want to remove the same character by position from each sequence. But you should consider if you want to account for gaps that potentially could be inserted in the sequences after alignment.

ADD REPLYlink written 7.8 years ago by Damian Kao15k

this post was cited in

Multiple auto- and allopolyploidisations marked the Pleistocene history of the widespread Eurasian steppe plant Astragalus onobrychis (Fabaceae)

ADD REPLYlink written 12 months ago by Pierre Lindenbaum129k
gravatar for JC
7.8 years ago by
JC11k wrote:

Another solution in Perl considering multi-line fasta:

use strict;
use warnings;

$ARGV[1] or die "use COLUMNS FASTA\n";
my @cols = split (/,/, $ARGV[0]);
@cols = sort {$b <=> $a} @cols;
open FA, "$ARGV[1]" or die;
$/ = "\n>";
while (<FA>) {
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq;
    foreach my $pos (@cols) {
        substr ($seq, $pos - 1, 1) = '';
    print ">$id\n";
    while ($seq) {
         print substr($seq, 0, 80);
         print "\n";
         substr($seq, 0, 80) = '';
close FA;

Testing with your data:

$ perl 4,5,6 test.fa 
ADD COMMENTlink written 7.8 years ago by JC11k

I like your use of substring.

ADD REPLYlink written 7.8 years ago by Zev.Kronenberg11k

How can I send these removed columns into another file. I tried to modify this script, but it is printing up to the column number I have given.

ADD REPLYlink written 5.4 years ago by venu6.7k
gravatar for Zev.Kronenberg
7.8 years ago by
United States
Zev.Kronenberg11k wrote:

If your sequences don't span multiple lines this will work. If they do you can change the input record separator '$/ '

The script:

use strict;
use warnings;
use Getopt::Long;

#----------------------------------- MAIN ------------------------------------                                                                      
my $usage = "                                                                                                                                       


REMOVE_COLUMNS_MULTI_FASTA -c 1,2,3,4 my.multi.fasta                                                                                                


-c is not zero based!                                                                                                                               


my ($help);
my $columns;
my $opt_success = GetOptions('help'      => \$help,
                             'columns=s' => \$columns );

die $usage if $help || ! $opt_success;

my $file = shift;
die $usage unless $file;

my @columns = map {$_ -= 1} split /,/, $columns;

open (my $IN, '<', $file) or die "Can't open $file for reading\n$!\n";

while (my $line = <$IN>) {
    chomp $line;
    if ($line =~ /^>/){
        print "$line\n"
        my @seq = split //, $line;
        foreach my $c (@columns){
            delete $seq[$c];
        my @p_seq = grep { defined } @seq;
        print join '', @p_seq;
        print "\n";


perl REMOVE_COLUMNS_MULTI_FASTA -c 1,5,18 example.fasta


ADD COMMENTlink modified 7.8 years ago • written 7.8 years ago by Zev.Kronenberg11k
gravatar for Sukhdeep Singh
7.8 years ago by
Sukhdeep Singh10k wrote:

Just for fun, a solution in R. Ofcourse, Perl/Python is faster but another solution in R.

This is slow, because of for loop and can be improved a bit but will be stil slow.

# Usage : Rscript removeColFas.R file.fa 2,3,5,9

# reading file in

# reading the fas columns to be removed

# splitting the characters and removing the non-desired columns

# reordering  the data back
count=0;for(i in seq(2,nrow(fas),by=2)){count=count+1;fas[i,]<-d[count]}

# writing the file out


ADD COMMENTlink modified 7.8 years ago • written 7.8 years ago by Sukhdeep Singh10k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 746 users visited in the last hour