Question: Taking Out Specific Columns From An Alignment
1
gravatar for figo
7.0 years ago by
figo200
figo200 wrote:

Hi All,

I wanted to write a script to remove certain columns which are in array from an alignment fasta file. I wrote a perl script using bioperl which is like this

#!/usr/bin/perl
use Bio::Seq;
use Bio::SeqIO;
use Bio::AlignIO;
use Bio::Align::AlignI;
use Bio::SimpleAlign;
my $str = Bio::AlignIO->new('-file' => 'aligned.fasta');
my $aln = $str->next_aln();
$out = $aln->remove_columns([4,5]);
my $aio = Bio::AlignIO->new(-fh=>\*STDOUT,-format=>"fasta");
$aio->write_aln($out);

Input data (example file)

>a_1
MEKFGMNFGGGPSKKDLL
>a_2
MEKFGMNFGGGPSKKDLL
>a_3
MEKFGMNFGGGPSKKDLL

But I am having problem that the remove _columns function only works for 2 columns only when I try to put say (e.g [4,5,6]) it doesn't work. So any one can tell me to come up with a solution may be in perl not using bioperl or something else.

perl alignment • 3.1k views
ADD COMMENTlink modified 7.0 years ago by Sukhdeep Singh9.9k • written 7.0 years ago by figo200
1

Will you reformat your code so it can be read? Also can you provide a small piece of data to test?

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by Zev.Kronenberg11k
1

formatting fixed.

ADD REPLYlink written 7.0 years ago by brentp23k

looks like remove_columns function is used to remove certain types of alignments. From the API docs:

"Creates an aligment with columns removed corresponding to the specified criteria: 'match'|'weak'|'strong'|'mismatch'|'gaps'"

It doesn't remove column by position. There is a slice function to get subcolumns of the alignment. You can probably use that to come up with something.

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by Damian Kao15k

The perl solutions below will work if you want to remove the same character by position from each sequence. But you should consider if you want to account for gaps that potentially could be inserted in the sequences after alignment.

ADD REPLYlink written 7.0 years ago by Damian Kao15k

this post was cited in

https://www.sciencedirect.com/science/article/pii/S1055790319301903

Multiple auto- and allopolyploidisations marked the Pleistocene history of the widespread Eurasian steppe plant Astragalus onobrychis (Fabaceae)

ADD REPLYlink written 11 weeks ago by Pierre Lindenbaum123k
2
gravatar for JC
7.0 years ago by
JC8.7k
Mexico
JC8.7k wrote:

Another solution in Perl considering multi-line fasta:

#!/usr/bin/perl
use strict;
use warnings;

$ARGV[1] or die "use deleteAlignColumn.pl COLUMNS FASTA\n";
my @cols = split (/,/, $ARGV[0]);
@cols = sort {$b <=> $a} @cols;
open FA, "$ARGV[1]" or die;
$/ = "\n>";
while (<FA>) {
    s/>//g;
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq;
    foreach my $pos (@cols) {
        substr ($seq, $pos - 1, 1) = '';
    }
    print ">$id\n";
    while ($seq) {
         print substr($seq, 0, 80);
         print "\n";
         substr($seq, 0, 80) = '';
    }
}
close FA;

Testing with your data:

$ perl deleteAlignColumn.pl 4,5,6 test.fa 
>a_1
MEKNFGGGPSKKDLL
>a_2
MEKNFGGGPSKKDLL
>a_3
MEKNFGGGPSKKDLL
ADD COMMENTlink written 7.0 years ago by JC8.7k

I like your use of substring.

ADD REPLYlink written 7.0 years ago by Zev.Kronenberg11k

How can I send these removed columns into another file. I tried to modify this script, but it is printing up to the column number I have given.

ADD REPLYlink written 4.5 years ago by venu6.3k
1
gravatar for Zev.Kronenberg
7.0 years ago by
United States
Zev.Kronenberg11k wrote:

If your sequences don't span multiple lines this will work. If they do you can change the input record separator '$/ '

The script:

#!/usr/bin/perl                                                                                                                              
use strict;
use warnings;
use Getopt::Long;



#-----------------------------------------------------------------------------                                                                      
#----------------------------------- MAIN ------------------------------------                                                                      
#-----------------------------------------------------------------------------                                                                      
my $usage = "                                                                                                                                       

Synopsis:                                                                                                                                           

REMOVE_COLUMNS_MULTI_FASTA -c 1,2,3,4 my.multi.fasta                                                                                                

Description:                                                                                                                                        

-c is not zero based!                                                                                                                               

";


my ($help);
my $columns;
my $opt_success = GetOptions('help'      => \$help,
                             'columns=s' => \$columns );

die $usage if $help || ! $opt_success;

my $file = shift;
die $usage unless $file;

my @columns = map {$_ -= 1} split /,/, $columns;

open (my $IN, '<', $file) or die "Can't open $file for reading\n$!\n";

while (my $line = <$IN>) {
    chomp $line;
    if ($line =~ /^>/){
        print "$line\n"
    }
    else{
        my @seq = split //, $line;
        foreach my $c (@columns){
            delete $seq[$c];
        }
        my @p_seq = grep { defined } @seq;
        print join '', @p_seq;
        print "\n";
    }
}

Usage:

perl REMOVE_COLUMNS_MULTI_FASTA -c 1,5,18 example.fasta

Output:

>a_1
EKFMNFGGGPSKKDL
>a_2
EKFMNFGGGPSKKDL
>a_3
EKFMNFGGGPSKKDL
ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Zev.Kronenberg11k
1
gravatar for Sukhdeep Singh
7.0 years ago by
Sukhdeep Singh9.9k
Netherlands
Sukhdeep Singh9.9k wrote:

Just for fun, a solution in R. Ofcourse, Perl/Python is faster but another solution in R.

This is slow, because of for loop and can be improved a bit but will be stil slow.

# Usage : Rscript removeColFas.R file.fa 2,3,5,9

# reading file in
fas=read.csv(commandArgs(TRUE)[1],header=F,stringsAsFactors=FALSE)

# reading the fas columns to be removed
rem=as.numeric(c(strsplit((commandArgs(TRUE)[2]),',')[[1]]))

# splitting the characters and removing the non-desired columns
d=unlist(lapply(as.character(fas[seq(2,nrow(fas),by=2),]),function(x){paste(strsplit(x,'')[[1]][-rem],collapse='')}))

# reordering  the data back
count=0;for(i in seq(2,nrow(fas),by=2)){count=count+1;fas[i,]<-d[count]}

# writing the file out
write.table(fas,paste(commandArgs(TRUE)[1],"_edited.fa",sep=''),quote=FALSE,col.names=FALSE,row.names=FALSE)

Cheers

ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Sukhdeep Singh9.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1137 users visited in the last hour