Hi,
I just looked at the shotgun community profiling files from HMP (http://hmpdacc.org/HMSCP/) and verified the following data structure (http://downloads.hmpdacc.org/data/HMSCP/SRS011061/SRS011061_abundance_table.tsv.bz2):
Reference Name Group Depth Breadth Total reference bases Bacteroides cellulosilyticus DSM 14838 Bacteria 172.513735774429 78.2996901104743 6726268
Is there any formula to calculate the number of reads which were assigned to Bacteroides cellulosilyticus DSM 14838?
According to What Is The Sequencing 'Depth' ?, to depth coverage is described as: "3. the empirical average "depth-of-coverage" of an assembly: number of reads * read length / assembly size"
Asssuming D the depth-of-coverage, N the number of reads, L the read length and S the assebmbly size, we can write this as:
D = N * L / S
and conclude that the number of reads can be calculated by:
N = D * S / L
But, if we take the showed data about Bacteroides cellulosilyticus DSM 14838 into this formula we get:
N = 172.513735774429 * 6726268 / 100 # 100 is the read length
N = 11603736
Am I right about this? The number of reads were 11,603,736?
If yes, then I have another problem. Looking at the metric file (http://downloads.hmpdacc.org/data/HMSCP/SRS011061/SRS011061_metric.txt.bz2) I have that my dataset has 51,038,355 reads:
Sample Name Number of reads passing low-complexity filter Number of pairs passing low-complexity filter Number of reads that align to reference SRS011061 89924337 26645245 51038355
And when I calculate the number of reads for each taxa in the abundance table and sum it I get 45,229,694 reads. So, how can be explained this difference? Can 5,808,661 reads (the difference) be discarded by any reason?
"Additionally we built strain abundance tables on a per-‐sample basis, showing the depth and breadth of coverage of each strain in our database. We reported all strains meeting a coverage cutoff of 0.01x depth across 1% breadth of each strain's reference genome."
Could this reads have been eliminated by this criterion?
Thank you, Felipe.