How to get ambiguous sequence from ABIF file with perl ?
1
0
Entering edit mode
7.2 years ago
jawhar.saks ▴ 10

Hello, i'm working on a little perl code that is supposed to get me the ambiguous sequence from .ab1 files. I want to get from the .ab1 the sequence with IUPAC codes.

Didn't found any solution to do it easily.

Read about Staden, but don't think it's what i need. Read about BioPerl, Bio::SeqIO, Bio;;Trace;;ABIF and couldn't find it if it exists...

I started to think about doing it myself from data. If we can do it when visualysing the graph, then we can do it on computer.

I first tryed to get the raw data. I tryed :

#!/bin/perl

use strict;
use warnings; 
use Bio::SeqIO;
use Bio::Trace::ABIF;
my $ab1 = Bio::Trace::ABIF->new();
$ab1->open_abif("$ARGV[0]");
print $ab1->sample_name(),"\n";
my @qualite = $ab1->quality_values();
my $sequenceab1 = $ab1->sequence();
my $LSeq = $ab1->sequence_length();

my $sequenceab2 = $ab1->edited_sequence(); #Tryed to search here, but not that...

print $ab1->raw_data_for_channel(1); #Gives me numbers but with no space... 
print $ab1->raw_data_for_channel(2);
print $ab1->raw_data_for_channel(3);
print $ab1->raw_data_for_channel(4);

open (FICHIER, ">$ARGV[2]/sortie.txt") || die ("Vous ne pouvez pas créer le fichier \"$ARGV[2]/sortie.txt\"");
print FICHIER $ab1->raw_trace('A');
print FICHIER $ab1->raw_trace('C');
print FICHIER $ab1->raw_trace('G');
print FICHIER $ab1->raw_trace('T');
close (FICHIER); #Same problem, no space...

I wandered a lot on different forum and sites, and even found some good things to try over here. That's why i decided to ask help here.

I tryed this i saw here : Exporting Raw Trace Data I tryed to implement something like that in my code, it results in :

    #!/bin/perl
    use strict;
    use warnings; 
    use Bio::SeqIO;
    use Bio::Trace::ABIF;
    use Array::Transpose; 
my $ab1 = Bio::Trace::ABIF->new();
$ab1->open_abif("$ARGV[0]");
open (FICHIER, ">$ARGV[2]/sortie.txt") || die ("Vous ne pouvez pas créer le fichier \"$ARGV[2]/sortie.txt\"");
print FICHIER foreach Transpose ([map {[$_,$ab1->raw_trace($_)]} qw(A T G C)]);
close (FICHIER);

But not working, can u explain to me a bit where i'm mistaken ?

I tryed this :

use Array::Transpose; 
my $abi;
BEGIN{$abi=Bio::Trace::ABIF->new($_);
$,="\t",$\="\n"};
$abi->open_abif(@ARGV) or die "can not open";
print FICHIER foreach transpose ([map {[$_,$abi->raw_trace($_)]} qw(A T G C)]);

But look like it makes perl freeze...

And the result is odd i got things like this :

ARRAY(0x3d97958)

But something nice, For the lines : print FICHIER $ab1->raw_trace('T'); The result now give spaces between values...

I'm not sure i get it all ...

Is there any documents that could help me ? I saw the documents of abif files, but i didn't understood it XD...

It's possible to get peak values, and after i was thinking of turning this results in percentages and if a base got more than 20% then it is a possible base and it will result in calling to the IUPAC code according of the basis found. Is it a good approach ?

Or should i modify the datas with baseline reductions or things like that ?

I hope i'm not at the wrong place. If i made a mistake or didn't respect the rules, please warn me to be able to correct myself. And i thank you for reading all this.

ambiguous abif perl sequence bioperl • 2.9k views
ADD COMMENT
2
Entering edit mode

As Shenwei stated, it's best to keep threads organized by adding replies as comments rather than answers. This is obviously best for the community, but it's also in your interest, as a question with 5 answers will attract fewer responses than a question with zero answers. I've moved your answers to comments. In the future, please do not post a response as an answer unless you believe it resolves the original question, to the point that in your opinion, further answers are unnecessary.

ADD REPLY
0
Entering edit mode

Oh, ok, thank you, i didn't get it proprely, i thoug i had to answer under people comments, but didn't know i couldn't bring new stuff in the main thread. I'll proceed as you say.

I hope i'll be able to post soon a response as an answer and put an end to this thread =D. Thank you.

ADD REPLY
0
Entering edit mode

I'm a bit confused, why the posts aren't sorted by date ?

I mean, on my screen got first some recent answers, then the oldest ones and the rest is sorted...

And my answer to shenwei isn't under his message, why is that ?

ADD REPLY
1
Entering edit mode

You didn't "answer" my answer, it became a independent one. To comment a answer, please click ADD COMMENT below it. To reply a comment, click the ADD REPLY.

Comments of same level are sorted by number of up-vote and then time.

Very glad to see you explore the solution by your own. But you continued bring new answers that I can't catch up with. :P

ADD REPLY
0
Entering edit mode

Oh yes, my bad, wasn't looking at the good comment.

I used what you told me about arrays, it made me understood that data i was seeking where in a chart. I used Transpose to get them called.

Pasted my code under your answer.

ps : Why can't u catch up with my answers ?

ADD REPLY
1
Entering edit mode

Well i start geting a bit tired for today, i think i'll end up on this.

I found that Phred may be the answer : https://en.wikipedia.org/wiki/Phred_quality_score#/media/File:Phred_Figure_1.jpg

And maybe a solution for perl use : http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-5-60

What do you think of it ? Can someone please tell me if i'm on a good path or not ?

ADD REPLY
0
Entering edit mode

Wow so fast to answer, thank you.

The first link is Bio.SeqIO for python, i think it's the same module than Bio::SeqIO of perl.

Well i know nothing in python, that's why i prefer doing it on perl. If not possible, i'll turn to python to try that, but i'll have to learn all python from the basics XD...

I saw the first link, and put it in memory for later.

The second link leads me to a question. What's the point converting in .scf file ? Isn't it the same ? What are the big advantages for me to use .scf ?

For the ARRAY(0x3d97958) , i don't get your answer. Where and how to dereference ?

ADD REPLY
1
Entering edit mode

Please click ADD COMMENT below the answer to reply it.

I'm not familiar with the .ab1 file. The sanger sequencing providers always give us the sequence file along with them. But according to your code, it seems could read the four channels. So you can continue to debug your Perl code.

For the ARRAY(0x3d97958), I mean the variable that contains ARRAY(0x3d97958) is a reference of an array. You may get the actual values by de-reference it using code like this:

use strict;

my @array = (1, 2, 3);
print "original array: @array\n";

my $ref = \@array;
print "reference of the array: $ref\n";

my @de_ref = @{$ref}; 
print "array the \$ref points to: @de_ref\n";

Result:

original array: 1 2 3
reference of the array: ARRAY(0x25f3948)
array the $ref points to: 1 2 3
ADD REPLY
0
Entering edit mode

Oh thank you, i will try that tonight, and i'll try to debug my code. I'll come back later.

And for my other questions ? Is there any one able to answer me ?

Is there a clear method to get ambiguous sequence ? A simple programme would be nice, or a module, but the theorical computings can be nice too. To know what to consider as a trace, and what is a real double peak...

ADD REPLY
0
Entering edit mode

Up, found out some things.

#!/bin/perl

use strict;
use warnings; 
use Bio::SeqIO;
use Bio::Trace::ABIF;
use Array::Transpose;

my $ab1 = Bio::Trace::ABIF->new();

BEGIN{$ab1 = Bio::Trace::ABIF->new($_);
$,="\t",$\="\n"};

$ab1->open_abif("$ARGV[0]") or die "Pb ouverture .ab1";


open (FICHIER, ">$ARGV[2]/sortie.txt");
print FICHIER @$_ foreach transpose ([map {[$_,$ab1->raw_trace($_)]} qw(A T G C)]);
close (FICHIER);    
$ab1->close_abif();

With this i can get the values in a nice format.

Now how can i exploit it properly ? Do you have any documentation to help me ?

ADD REPLY
0
Entering edit mode

Hi again, still progressing and got new stuff.

I just found this : Is there a way to access the data stored in a .ab1 file ?

It looks awsome, but cant find any equivalent for perl...

I found also here : http://bioconductor.org/packages/release/bioc/vignettes/sangerseqR/inst/doc/sangerseq_walkthrough.pdf

This :

The function essentially divides the sequence into a series of basecall windows and identifies the tallest peak for each fluorescence channel within the window. These peaks are converted to signal ratio to the tallest peak. A cutoff ratio is then applied to determine if a peak is signal or noise. Peaks below this ratio are ignored. Remaining peaks in each window are used to make primary and secondary basecalls.

Can it be that easy ? Get a ratio on the strongest signal, use the noise ratio to delete it. Then choose a percentage of validation ?

Oh and find nice things into .ab1 file :

$ab1->sample_score(); # 58 seams to be the average score of sample
$ab1->clear_range(); # give me 17 and 314, not sure about what it is yet, looks like 17 can be minimum for admissibility... Not far from the average score...
$ab1->noise(); #Gives me a ratio
$ab1->peak1_location(); #Gives me what appears to be the location of first pertinent peak

Am i correct about thoose elements ?

ADD REPLY
0
Entering edit mode

Hi everyone, i worked all day on that today too, and found out some good stuff, should i create another question for that ?

I'm willing to redirect the reflexion.

I found that POSA can be the solution, using phred score to get the ambiguous sequence, but i can't figure how to do that.

And in the exemples, there isn't one for my case or similar...

I managed to putt all datas in a Hash table, to be able to call easily the values for any postition of anybase easily.

But it looks like it's the graph data and not the calculated peaks data...

It means that is POSA isn't the solution, i'll have to calculate it and make an algorythm for it ... It really means that i'll have to open up my old math lessons about graph and statistics...

If you got any lead, i'll take it... Thank you for your time

ADD REPLY
0
Entering edit mode

Up, stil stuck on that problem...

How to get the iupac sequence from the graph datas ?...

Any one able to help me ?

ADD REPLY
0
Entering edit mode

I, tryed to do without that by using some software for the moment.

But i'm still interested in achieving it.

Is it possible to do it threw emboss or threw some python automatic module ?

ADD REPLY
1
Entering edit mode
7.2 years ago

Biopython support AB1 files, but not sure does it fulfill your requirement.

Similar posts:

For your code, ARRAY(0x3d97958) means it's a reference of an array, you need to dereference it by @{$ref}.

ADD COMMENT

Login before adding your answer.

Traffic: 4031 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6