Question: Fasta alignment to Tajima's D and population statistics
1
gravatar for matthew.m.hernandez
3.3 years ago by
matthew.m.hernandez30 wrote:

Hi there,

I'm a newbie who's trying to make use of the Bio::PopGen packages to analyze measurements of diversity in different populations of viral quasispecies. I can turn to DNAsp on Windows to analyze different FASTA alignments. This is straightforward and easy but takes a long time for 100 different alignments.

I've tried turning to BioPerl to do this and have written a script to take a fasta alignment and run Tajima's analyses on it to return statistics. It runs fine, but I am getting very different values for certain metrics than what I would get on DNAsp.

Here is my code (my input file is a fasta formatted alignment file):

#!/usr/bin/perl                                                                                                                                                                                    
use warnings;
use strict;
use Bio::AlignIO;
use Bio::PopGen::IO;
use Bio::PopGen::Statistics;
use Bio::PopGen::Utilities;

my $infile = shift;

my $in = Bio::AlignIO->new(-format => 'fasta', -file => $infile);
my $aln;
if($aln = $in->next_aln){
}

my $pop = Bio::PopGen::Utilities->aln_to_population(-alignment=>$aln, -include_monomorphic =>1);

my $pi = Bio::PopGen::Statistics->pi($pop);
my $theta = Bio::PopGen::Statistics->theta($pop);
my $D = Bio::PopGen::Statistics->tajima_D($pop);
my $segsites = Bio::PopGen::Statistics->segregating_sites_count($pop);

print "Pi = $pi"."\n"."Theta = $theta"."\n"."Tajima's D = $D"."\n"."SegSites = $segsites"."\n";

If I compare the results from DNAsp and Bioperl I get the following:

DNAsp

Pi: 0.00020 Theta (per site): 0.00958 Tajima's D: -2.21621 SegSites: 18

BioPerl

Pi: 0.04944 Theta: 2.16841 Tajima's D: -2.21623 SegSites: 18

Does anyone know why Pi and Theta are so different? Moreover, is there any way that I can get it to spit out the number of sequences in the alignment file and measure the statistical significance of the Tajima's D-test?

Thanks,

Matt

ADD COMMENTlink written 3.3 years ago by matthew.m.hernandez30
3

My guess but DNASP gives per site value (divided by the length of ungapped alignment) whereas BioPerl seems go give added diversity for the whole sequence. Can you divide by the ungapped alignment length and see if the values start matching ?

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by microfuge1.5k

So this makes sense for the Pi value! However, this doesn't work for the theta statistic (2.16841/249 = 0.00871 which is pretty off from 0.00958). Of note, my sequence length is 249bp.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by matthew.m.hernandez30
1

You can use $aln->no_sequences to get the number of sequences and I agree with the other comment about the statistics. Dividing Theta and Pi by the length works out. I think the different implementation of the programs and how they deal with floating point numbers accounts for the slight variation. I'd be pretty confident to have that level of agreement.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by SES8.2k

Thanks for addressing the number of sequences flag! However, I'm still concerned with the theta value. Do you still have confidence in a difference between 0.00871 and 0.00958?

ADD REPLYlink written 3.3 years ago by matthew.m.hernandez30

No, I don't and would not want to speculate until I looked into what both are doing exactly. It could be an artifact of the floating point issue I mentioned or it could be differences in what is calculated, and it's important to know which. You can look at the documentation to dive into what is being calculated for BioPerl (or the code, though it's obviously very dense and mathematical in nature). Hope that helps!

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by SES8.2k

Ahh, gotcha. Thanks for the insight!

ADD REPLYlink written 3.3 years ago by matthew.m.hernandez30

Matthew: Did you figure out the answers to your questions and resolve it to your satisfaction so that you may now have a version of your Perl script that can be shared for pop gen analyses? Could you let us know please. Thanks!

ADD REPLYlink written 2.5 years ago by Anand Rao250
1

Sorry for the late response Anand. I did not figure out what was going on with the BioPerl analysis. I ended up just sticking to DNAsp :-/

ADD REPLYlink written 19 months ago by matthew.m.hernandez30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 758 users visited in the last hour