Question

Rarefaction/Saturation Curve Based On Ngs Data

6

Entering edit mode

14.6 years ago

Biogenomics ▴ 60

Hi all,

This is most likely a simple question, but I'm looking for a tool (software, python/Perl/R script) that would produce a rarefaction curve based on an assembly file (ACE format would be easiest) to assess the number of reads needed to yield all observed contigs (cfr species diversity index). This would most likely be done through sampling reads within the ACE file and aligning them on the assembled contigs. I am interested to compare such rarefaction curves for data produced from normalized and non normalized libraries.

Alternatively, what approach would you use to automate (or semi-automate) such a task?

thanks

Greg

• 8.3k views

ADD COMMENT • link updated 10.5 years ago by mikhail.shugay 3.5k • written 14.6 years ago by Biogenomics ▴ 60

0

Entering edit mode

Hello greg, were you able to find some tool/script for the analysis? Could you let us know if you were able to?

ADD REPLY • link 11.7 years ago by Prakki Rama ★ 2.7k

score 2 · Answer 1 · 2011-03-01

2

Entering edit mode

14.4 years ago

Casey Bergman 18k

As you allude to, your problem is related to species richness calculations, so perhaps you could have a look at how to pose your problem in those terms and use rarefaction functions in a meta-genomics suite like mothur. Another option would be to pinch functions from the mothur source code and adapt to your problem.

ADD COMMENT • link 14.4 years ago by Casey Bergman 18k

0

Entering edit mode

Mothur really works! I like it.

ADD REPLY • link 14.4 years ago by Jarretinha 3.5k

0

Entering edit mode

Hi jarrentinha, i am new to this kind of analysis. If possible, could you let us know how mothur can be used to plot the saturation curve between number of reads and number of genes?

ADD REPLY • link 11.7 years ago by Prakki Rama ★ 2.7k

Ram · Answer 2 · 2015-01-07

As this topic was raised again, I would recommend reading Colwell et al on this topic. I would also like to ask what is the input format? If you have a simple frequency table, say

150 genes have 1 read

100 genes have 2 reads

...

1 gene has 6534 reads

...

1 gene has 20000 reads

I could share some code to build those rarefaction curves (and I think there are also a plenty of ecology-related software packages). Or you can adapt code from here: https://github.com/mikessh/vdjtools/blob/master/src/main/groovy/com/antigenomics/vdjtools/diversity/ChaoEstimator.groovy