Question: Parsing Ace File To Fasta Formated Alignement
1
gravatar for Eric Normandeau
9.5 years ago by
Quebec, Canada
Eric Normandeau10k wrote:

Hi,

Following alignment of 454 data, I want to convert an ACE file containing contig data to a FASTA file containing the 'aligned' consensus sequences. By aligned, I mean that each sequence within a contig has the same length. Characters (-) are added as needed on each side of a sequence in order to make it the same length as the consensus sequence from it's contig.

What method would you suggest?

Thanks!

fasta conversion parsing • 6.0k views
ADD COMMENTlink modified 9.5 years ago by Federico Giorgi520 • written 9.5 years ago by Eric Normandeau10k

Can you rephrase your question? This does not really make sense to me.

ADD REPLYlink written 9.5 years ago by Michael Dondrup46k

Ah now I get it: you want to convert the output of a sequence assembly (ACE assembly file format) into a FASTA file containing the consensus sequences, right?

ADD REPLYlink written 9.5 years ago by Michael Dondrup46k

Right :) I rephrase to make a lesser mouthful!

ADD REPLYlink written 9.5 years ago by Eric Normandeau10k
3
gravatar for Jarretinha
9.5 years ago by
Jarretinha3.3k
São Paulo, Brazil
Jarretinha3.3k wrote:

Another quick and dirty bioperl solution. Well, not so quick. This script is quite sequential.

use Bio::Assembly;
use Bio::SeqIO;
use Bio::Seq;

$infile = shift or die;
$outfile = shift or die;

$seqio_obj = Bio::SeqIO->new(-file => ">$outfile", -format => 'fasta' );

$assembly_obj = Bio::Assembly::IO->new(-file=>"<$infile", -format=>'ace');

$assembly = $assembly_obj->next_assembly;

foreach $contig ($assembly->all_contigs) {

    $seq = $contig->get_consensus_sequence()

    $seq_obj = Bio::Seq->new(-seq => $seq->seq(),                        
                          -display_id => "ID_is_always_good",                        
                          -desc => "Say_smth_about_it",                        
                          -alphabet => "dna" );

    $seqio_obj->write_seq($seq_obj);

}

I'm not sure if it works cause I'm without any ACE files. Of course, you can write your own parser as ACE files are very simple in structure, as one can see here. Check out Bio::Assembly methods. There a lot of ready-to-use utilities for size, quality, features, etc. I'll check for a biopython solution.

ADD COMMENTlink modified 13 months ago by RamRS24k • written 9.5 years ago by Jarretinha3.3k
2
gravatar for Eric Normandeau
9.5 years ago by
Quebec, Canada
Eric Normandeau10k wrote:

@Jarretinha,

Thanks for your BioPerl solution! It gave me the urge to look at Biopython could do for me, since I only speak snake...

Here is what I found:

from Bio.Sequencing import Ace
from Bio.Align.Generic import Alignment
from Bio.Alphabet import IUPAC, Gapped

with open(output_file, "w") as output_file:
    while 1:
        try:
            contig = ace_gen.next()
        except:
            print "***All contigs treated***"
            break
        align = Alignment(Gapped(IUPAC.ambiguous_dna, "-"))
        align.add_sequencecontig.name, contig.sequence)
        for readn in xrange(len(contig.reads)):
            clipst = contig.reads[readn].qa.qual_clipping_start
            clipe = contig.reads[readn].qa.qual_clipping_end
            start = contig.af[readn].padded_start
            seq = cut_ends(contig.reads[readn].rd.sequence, clipst, clipe)
            seq = pad_read(seq, start, len(contig.sequence))
        sequences = read_fasta(align.format("fasta"))
        contig_name = re.findall("(Contig_[0-9]+)", sequences[0][0])[0]
        # Put your code here to work with the contig's sequences

I removed a lot of comments from the code and added a few features. The original example can be found HERE among the Biopython pages.

Thanks again!

ADD COMMENTlink modified 13 months ago by RamRS24k • written 9.5 years ago by Eric Normandeau10k
2
gravatar for Ketil
8.6 years ago by
Ketil4.0k
Germany
Ketil4.0k wrote:

Some time ago I wrote a few tools to extract stuff from ACE files, including the contigs + quality, the assembly as Fasta with '-' for gaps (what you're asking for), and also the clusters (as list of input sequences, a la TGICL output).

Drop me a mail if you're interested.

ADD COMMENTlink modified 8 weeks ago by RamRS24k • written 8.6 years ago by Ketil4.0k
1
gravatar for Federico Giorgi
8.0 years ago by
Columbia University
Federico Giorgi520 wrote:

Although elegant, the Bioperl/Biopython solutions are slow and tend to keep too many contig objects in memory. A simple ACE to Fasta perl extractor (assuming you want the contig sequences) would be this:

#!/usr/bin/perl
use strict;
use warnings;


# CO contig00001 67140 1618 1666 U

my $infile = $ARGV[0];
my $outfile = $ARGV[1];

open INPUT, $infile or die $!;
open OUTPUT, ">$outfile" or die $!;

my $waitForHeader = 1;

while (my $line = <INPUT>) {
    if ($waitForHeader) {
        if ($line=~"^CO") {
            my @splitter = split (" ",$line);
            print OUTPUT ">"."$splitter[1]\n";
            $waitForHeader = 0;
        }
        else {
            next;
        }
    }
    else {
        if ($line=~"^BQ") {
            $waitForHeader=1;
        }
        else {
            unless ($line eq "\n") {
                $line=~s/\*/-/g;
                print OUTPUT $line;
            }    
        }
    }
}

close INPUT;
close OUTPUT;
ADD COMMENTlink modified 13 months ago by RamRS24k • written 8.0 years ago by Federico Giorgi520
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2036 users visited in the last hour