Question

BBMap Statistics Evaluation

0

Entering edit mode

3.8 years ago

Ada ▴ 10

Hello, I would like assistance understanding the following results below:

What is the fraction that "%unambiguousReads" is out of? Essentially, how do they come up with the fraction. What determines the denominator? In addition, if the fraction is 0.87 does this mean 87%? or 0.87%?
What is the fraction that "%ambiguousReads" is out of? Essentially, how do they come up with the fraction. What determines the denominator?
What does assignedReads mean?
What does assignedBases mean?

What does the MB mean in unambiguousMB/ambiguousMB?

name    %unambiguousReads   unambiguousMB   %ambiguousReads ambiguousMB unambiguousReads    ambiguousReads  assignedReads   assignedBases
NC_009801.1 Escherichia coli O139:H28 str. E24377A, complete sequence   0.87367 7.1124  4.07912 33.2073 47416   221382  111291  16693650
NZ_GG773290.1 Escherichia coli MS 78-1 Scfld327, whole genome shotgun sequence  0.34082 2.77455 0.05644 0.45945 18497   3063    18747   2812050

alignment Assembly gene sequencing • 1.4k views

ADD COMMENT • link updated 3.8 years ago by Istvan Albert 100k • written 3.8 years ago by Ada ▴ 10

0

Entering edit mode

Since you are looking at two E.coli genomes it is not surprising that the % unambiguous reads is very small. No aligner is going to be able to distinguish between very similar genomes of the same species especially when short reads are being used. I am curious as to where the remaining 95% of reads are since they do not seem to be accounted for by these two lines.

ADD REPLY • link 3.8 years ago by GenoMax 141k

score 0 · Answer 1 · 2020-07-07

I don't know what BBMap does specifically, but typically the denominator is the total number of reads, or the total number of mapped reads, depending on the circumstance.

In this case, it seems that the total number of reads was not reported in the statistics, hence we can't check that assumption.

I would expect that assigned means reads that the read could be mapped (assigned to a location).

I would expect that unambiguous means that a read maps to a single location.

I would expect that ambiguous means that a read maps equally well to more than one location.

The MB means megabase (millions of bases)

score 0 · Answer 2 · 2020-07-07

Total number of reads is reported in BBMap/BBsplit stats. OP has not included that information. Typical result looks like this:

Genome:                 1
Key Length:             13
Max Indel:              20
Minimum Score Ratio:    0.56
Mapping Mode:           normal
Reads Used:             53236   (7985400 bases)

Mapping:                46.016 seconds.
Reads/sec:              1156.89
kBases/sec:             173.53


Pairing data:           pct pairs       num pairs       pct bases          num bases

mated pairs:             86.2236%           22951        86.2236%            6885300
bad pairs:                2.1527%             573         2.1527%             171900
insert size avg:          435.68


Read 1 data:            pct reads       num reads       pct bases          num bases

mapped:                  91.7499%           24422        91.7499%            3663300
unambiguous:             87.2943%           23236        87.2943%            3485400
ambiguous:                4.4556%            1186         4.4556%             177900
low-Q discards:           0.0000%               0         0.0000%                  0

In fact output posted by original poster is for bbsplit.sh refstats option. So these result needs to be taken into consideration with the main output of bbsplit.sh run which looks like the bbmap.sh I posted above (bbsplit.sh uses bbmap.sh under the covers to do the read binning). Example of that looks like:

#name   %unambiguousReads    unambiguousMB   %ambiguousReads ambiguousMB     unambiguousReads ambiguousReads   assignedReads   assignedBases
human   88.33496              7.053900           0.30806     0.024600         47026             164             47026          7053900
mouse   6.07108               0.484800           0.30806      0.024600         3232            164              3396            509400