Question: Convert Sequence Alignment Into A Prettybase Polymorphic Sites Table And Calculate Pi, Theta, And Tajima'S D Using Bioperl?
0
gravatar for anin.gregory
6.0 years ago by
anin.gregory80
United States
anin.gregory80 wrote:

Hi I am new to this forum! I am trying to pull all the Polymorphic Sites (SNPs) out of a Fasta or Clustal alignment and create a table in Prettybase format (http://genapps.uchicago.edu:8081/maxdip/sample.pretty), so I can calculate pi, theta, and tajima's D.

Fasta Alignment:
>A1
ATGCA
>A2
ATGCG
>A3
ATGCA
>A4
ATCGG

Convert to (position \t individual \t allele) - PRETTYBASE format:
01    A1    A
01    A2    A
01    A3    A
01    A4    A
02    A1    T
02    A2    T
02    A3    T
02    A4    T
03    A1    G
03    A2    G
03    A3    G
03    A4    C
04    A1    C
04    A2    C
04    A3    C
04    A4    G
05    A1    A
05    A2    G
05    A3    A
05    A4    G

I am am fairly proficient in BioPerl, but have come to a wall. I have looked at all the CPAN and BioPerl websites about the different modules and I can't figure out how to do it? Any help would be great! Or do you another way to calculate pi, theta, and tajima's D from alignment data?

Here is my script so far:

#!/usr/bin/perl
use warnings;
use strict;
use Bio::SearchIO;
use Bio::SeqIO;

my $seqio  = Bio::SeqIO->new(-format => 'fasta', -file   => "$ARGV[0]");

while (my $seqobj = $seqio->next_seq) {
        my $seqid = $seqobj->display_id;
        my $nuc = $seqobj->seq();
        push (my @ids, $seqid);
        push (my @nucs, $nuc);
        my @spl = split(//, $nucs[0]);
    my $count=1;
    my $count2=1;
    my $count3=1;
    my $count4=1;
        foreach my $i (@spl) {
         my $position = $count++;
                my $individual = $ids[0];
            print $position."\t".$individual."\t".$i."\n";
        }
        my @spl2 = split(//, $nucs[1]);
        foreach my $k (@spl2) {
                my $position2 = $count2++;
                my $individual2 = $ids[1];
            print $position2."\t".$individual2."\t".$k."\n";
        }
        my @spl3 = split(//, $nucs[2]);
        foreach my $j (@spl3) {
                my $position3 = $count3++;
                my $individual3 = $ids[2];
            print $position3."\t".$individual3."\t".$j."\n";
        }
        my @spl4 = split(//, $nucs[3]);
        foreach my $l (@spl4) {
                my $position4 = $count4++;
                my $individual4 = $ids[3];
            print $position4."\t".$individual4."\t".$l."\n";
        }
}
perl alignment snps bioperl • 3.1k views
ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by anin.gregory80
2
gravatar for anin.gregory
6.0 years ago by
anin.gregory80
United States
anin.gregory80 wrote:

Figured it out! The script above I edited and is correct for 4 sequences...you will need to modify it if you are comparing more than 4 sequences.

Once you have your prettybase format, you can pipe the prettybase format into this script to calculate pi, theta and tajima's D:

#!/usr/bin/perl
use warnings;
use strict;
use Bio::PopGen::IO;
use Bio::PopGen::Statistics;

my $pop =Bio::PopGen::IO->new(-format => "prettybase", -file => $ARGV[0])->next_population;
my $pi = Bio::PopGen::Statistics->pi($pop);
my $theta = Bio::PopGen::Statistics->theta($pop);
my $D = Bio::PopGen::Statistics->tajima_D($pop);
print "Pi = $pi"."\n"."Theta = $theta"."\n"."Tajima's D = $D"."\n";
ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by anin.gregory80

+1 Great! Thanks for posting your solution.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by SES8.2k

Excellent script, would be really useful to determine nr of sequences automatically. I did have a problem however, the Tajima D value never printed, not sure what the error was:

@BioPower3-IBM ~/nosema/annotation/diversity/no_stop_codons $ perl trial/Fasta2Prettybase.pl M715_870002501/aligned_nt.fasta > tmp1; perl trial/Prettybase2ThetaPiTajima.pl tmp1; rm -rf tmp1
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 24, <GEN0> line 1.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 30, <GEN0> line 1.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 36, <GEN0> line 1.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 24, <GEN0> line 2.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 30, <GEN0> line 2.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 36, <GEN0> line 2.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 24, <GEN0> line 3.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 30, <GEN0> line 3.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 36, <GEN0> line 3.
Use of uninitialized value $D in concatenation (.) or string at trial/Prettybase2ThetaPiTajima.pl line 11.
Pi = 383.333333333335
Theta = 364.666666666667
Tajima's D =

This all may be because I did not change the script but only had 3 sequences in my alignment.

Would be also very useful to merge the 2 scripts into 1. Too bad I am not pro efficient in perl.

ADD REPLYlink written 4.3 years ago by Adrian Pelin2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 832 users visited in the last hour