Convert Sequence Alignment Into A Prettybase Polymorphic Sites Table And Calculate Pi, Theta, And Tajima'S D Using Bioperl?
1
0
Entering edit mode
11.0 years ago
anin.gregory ▴ 110

Hi I am new to this forum! I am trying to pull all the Polymorphic Sites (SNPs) out of a Fasta or Clustal alignment and create a table in Prettybase format (http://genapps.uchicago.edu:8081/maxdip/sample.pretty), so I can calculate pi, theta, and tajima's D.

Fasta Alignment:
>A1
ATGCA
>A2
ATGCG
>A3
ATGCA
>A4
ATCGG

Convert to (position \t individual \t allele) - PRETTYBASE format:
01    A1    A
01    A2    A
01    A3    A
01    A4    A
02    A1    T
02    A2    T
02    A3    T
02    A4    T
03    A1    G
03    A2    G
03    A3    G
03    A4    C
04    A1    C
04    A2    C
04    A3    C
04    A4    G
05    A1    A
05    A2    G
05    A3    A
05    A4    G

I am am fairly proficient in BioPerl, but have come to a wall. I have looked at all the CPAN and BioPerl websites about the different modules and I can't figure out how to do it? Any help would be great! Or do you another way to calculate pi, theta, and tajima's D from alignment data?

Here is my script so far:

#!/usr/bin/perl
use warnings;
use strict;
use Bio::SearchIO;
use Bio::SeqIO;

my $seqio  = Bio::SeqIO->new(-format => 'fasta', -file   => "$ARGV[0]");

while (my $seqobj = $seqio->next_seq) {
        my $seqid = $seqobj->display_id;
        my $nuc = $seqobj->seq();
        push (my @ids, $seqid);
        push (my @nucs, $nuc);
        my @spl = split(//, $nucs[0]);
    my $count=1;
    my $count2=1;
    my $count3=1;
    my $count4=1;
        foreach my $i (@spl) {
         my $position = $count++;
                my $individual = $ids[0];
            print $position."\t".$individual."\t".$i."\n";
        }
        my @spl2 = split(//, $nucs[1]);
        foreach my $k (@spl2) {
                my $position2 = $count2++;
                my $individual2 = $ids[1];
            print $position2."\t".$individual2."\t".$k."\n";
        }
        my @spl3 = split(//, $nucs[2]);
        foreach my $j (@spl3) {
                my $position3 = $count3++;
                my $individual3 = $ids[2];
            print $position3."\t".$individual3."\t".$j."\n";
        }
        my @spl4 = split(//, $nucs[3]);
        foreach my $l (@spl4) {
                my $position4 = $count4++;
                my $individual4 = $ids[3];
            print $position4."\t".$individual4."\t".$l."\n";
        }
}
bioperl perl snps alignment • 4.6k views
ADD COMMENT
2
Entering edit mode
11.0 years ago
anin.gregory ▴ 110

Figured it out! The script above I edited and is correct for 4 sequences...you will need to modify it if you are comparing more than 4 sequences.

Once you have your prettybase format, you can pipe the prettybase format into this script to calculate pi, theta and tajima's D:

#!/usr/bin/perl
use warnings;
use strict;
use Bio::PopGen::IO;
use Bio::PopGen::Statistics;

my $pop =Bio::PopGen::IO->new(-format => "prettybase", -file => $ARGV[0])->next_population;
my $pi = Bio::PopGen::Statistics->pi($pop);
my $theta = Bio::PopGen::Statistics->theta($pop);
my $D = Bio::PopGen::Statistics->tajima_D($pop);
print "Pi = $pi"."\n"."Theta = $theta"."\n"."Tajima's D = $D"."\n";
ADD COMMENT
0
Entering edit mode

+1 Great! Thanks for posting your solution.

ADD REPLY
0
Entering edit mode

Excellent script, would be really useful to determine nr of sequences automatically. I did have a problem however, the Tajima D value never printed, not sure what the error was:

@BioPower3-IBM ~/nosema/annotation/diversity/no_stop_codons $ perl trial/Fasta2Prettybase.pl M715_870002501/aligned_nt.fasta > tmp1; perl trial/Prettybase2ThetaPiTajima.pl tmp1; rm -rf tmp1
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 24, <GEN0> line 1.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 30, <GEN0> line 1.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 36, <GEN0> line 1.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 24, <GEN0> line 2.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 30, <GEN0> line 2.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 36, <GEN0> line 2.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 24, <GEN0> line 3.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 30, <GEN0> line 3.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 36, <GEN0> line 3.
Use of uninitialized value $D in concatenation (.) or string at trial/Prettybase2ThetaPiTajima.pl line 11.
Pi = 383.333333333335
Theta = 364.666666666667
Tajima's D =

This all may be because I did not change the script but only had 3 sequences in my alignment.

Would be also very useful to merge the 2 scripts into 1. Too bad I am not pro efficient in perl.

ADD REPLY

Login before adding your answer.

Traffic: 2899 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6