Question: HMP Abundance Table (HMSCP) Interpretation for Calculate Taxa Abundance
1
3.8 years ago by
felipepratalima10 wrote:

Hi,

I just looked at the shotgun community profiling files from HMP (http://hmpdacc.org/HMSCP/) and verified the following data structure (http://downloads.hmpdacc.org/data/HMSCP/SRS011061/SRS011061_abundance_table.tsv.bz2):

Reference Name Group Depth Breadth Total reference bases Bacteroides cellulosilyticus DSM 14838 Bacteria 172.513735774429 78.2996901104743 6726268

Is there any formula to calculate the number of reads which were assigned to Bacteroides cellulosilyticus DSM 14838?

According to What Is The Sequencing 'Depth' ?, to depth coverage is described as: "3. the empirical average "depth-of-coverage" of an assembly: number of reads * read length / assembly size"

Asssuming D the depth-of-coverage, N the number of reads, L the read length and S the assebmbly size, we can write this as:

D = N * L / S

and conclude that the number of reads can be calculated by:

N = D * S / L

But, if we take the showed data about Bacteroides cellulosilyticus DSM 14838 into this formula we get:

N = 172.513735774429 * 6726268 / 100 # 100 is the read length

N = 11603736

If yes, then I have another problem. Looking at the metric file (http://downloads.hmpdacc.org/data/HMSCP/SRS011061/SRS011061_metric.txt.bz2) I have that my dataset has 51,038,355 reads:

Sample Name Number of reads passing low-complexity filter Number of pairs passing low-complexity filter Number of reads that align to reference SRS011061 89924337 26645245 51038355

And when I calculate the number of reads for each taxa in the abundance table and sum it I get 45,229,694 reads. So, how can be explained this difference? Can 5,808,661 reads (the difference) be discarded by any reason?

"Additionally we built strain abundance tables on a per-­‐sample basis, showing the depth and breadth of coverage of each strain in our database. We reported all strains meeting a coverage cutoff of 0.01x depth across 1% breadth of each strain's reference genome."

Could this reads have been eliminated by this criterion?

Thank you, Felipe.

metagenomics • 1.3k views
modified 3.2 years ago by lotus2860 • written 3.8 years ago by felipepratalima10
0
3.2 years ago by
lotus2860
lotus2860 wrote:

I believe the reason of the difference is that the read length mentioned at HMP site is not right.
If it is not 101 as claimed but 90-92, then there is no difference. Have you checked the BAM file?