Interpreting bbduk histogram headers and contents
2
2
Entering edit mode
2.5 years ago

We are a small group of undergrads, mostly sophomore, from a small HBCU, and learning bioinformatics and genomics that is not at all part of our regular syllabus, by trying to teach one another :)

Right now, we are focused on how to use BBmap and other such tools to process Illumina raw reads - but we are new to all of this, and experiencing a steep learning curve.

So could someone please educate us on how to interpret the **BBDUK.sh** stats options that create the various histogram files that are generated by invoking the following flags: qhist, qchist, aqhist, bqhist, and lhist ?

From several different posts here on BioStars and a few on SeqAnswers, we decided on this syntax for BBDUK based adapter and quality trimming, to keep it gentle (not too stringent) but also impose min length to prevent very crazy multi-mapping during read alignment to the reference genome (we're not there yet):

bbduk.sh -Xmx34g in=TrialSRR.fq.gz \
ftr=99  qtrim=r trimq=15 \
qhist=TrialSRR_minlen20.qhist \
qchist=TrialSRR_minlen20.qchist \
aqhist=TrialSRR_minlen20.aqhist \
bqhist=TrialSRR_minlen20.bqhist \
lhist=TrialSRR_minlen20.lhist;


When we generated these files, we see their contents as shown below. From these contents, think we understand a few of these numbers and column headings, but not sure that we do. Some things we gathered from this forum and the internet in general, as well as our educated guesses, are listed below. We request forum experts to add, comment, correct these inferences, and answer our pending questions. Thank you!

1. In our data, the maximum Illumina read length is 100 + 1 where the final position is poorly calibrated and ideally trimmed out. Base# here varies from 1-101 (human coding) or 0-100 (machine coding starting with 0 index)?
2. In our data, 40 is the highest PHRED score we see in our input Illumina FASTQ file, which is also the empirical maximum for the HiSeq 2500 platform used to create these data, or is the maximum 50 ?
3. In QHIST file, column total under count1 is # of nt for each PHRED score between 0 and 41 - totaling, 1614284414 and total fraction1 is 1.00003.
4. 15983014 reads * (100+1) nt / read = 1614284414 total nt across all reads in this library, so that tallies. It is not clear where this 1 extra nt comes from during sequencing, even from reading from other Biostars posts. Please help. We know this 1 nt must be removed, since it almost always poor quality, right?
5. In the BQHIST file, before any trimming of adapter or by quality, in the input file, where reads are uniformly 100nt long, these gives the stats for each of those 100 positions, right? As expected, PHRED score improves from 5'end downstream and slightly degrades at the 3' end, right?
7. In AQHIST file, we think the count1 is total number of input raw reads = 15983014 here, and we are guessing the number of reads with average PHRED score across it's entire length, and their corresponding fraction of the total is the third column, but not sure at all.

We guess the BQHIST headers mean the following: - but what are LW_1 and RW_1?

BaseNum=nt position in read
min_1=minimum PHRED score at this position
max_1=maximum PHRED score at this position
mean_1=average PHRED score at this position
Q1_1=1st quartile of PHRED score at this position
med_1=Median or 2nd quartile of PHRED score at this position
Q3_1=3rd quartile of PHRED score at this position
**LW_1 = ??? not sure!**  Help!
**RW_1 = ??? not sure!**  Help!


PS. Apologies for any weirdness in our post's formatting, it's our first time here, and trying to use this markdown system....

HIST file listing

-rw-rw-r-- 1 lotte lotte   27 Nov 16 21:00 TrialSRR_minlen20.lhist
-rw-rw-r-- 1 lotte lotte  820 Nov 16 21:00 TrialSRR_minlen20.qchist
-rw-rw-r-- 1 lotte lotte 3.8K Nov 16 21:00 TrialSRR_minlen20.bqhist
-rw-rw-r-- 1 lotte lotte 1.7K Nov 16 21:00 TrialSRR_minlen20.qhist
-rw-rw-r-- 1 lotte lotte  740 Nov 16 21:00 TrialSRR_minlen20.aqhis


LHIST file contents

lotte@Uni-HPC:~Learning$wc -l TrialSRR_minlen20.lhist 2 TrialSRR_minlen20.lhist lotte@Uni-HPC:~Learning$ cat TrialSRR_minlen20.lhist
#Length Count
101 15983014


QCHIST file contents

lotte@Uni-HPC:~Learning$wc -l TrialSRR_minlen20.qchist 43 TrialSRR_minlen20.qchist lotte@Uni-HPC:~Learning$ cat TrialSRR_minlen20.qchist
#Quality    count1  fraction1
0   277949  0.00017
1   0   0.00000
2   57657078    0.03572
3   0   0.00000
4   0   0.00000
5   369144  0.00023
6   1014149 0.00063
7   5704849 0.00353
8   5889111 0.00365
9   3397890 0.00210
10  4329169 0.00268
11  1607492 0.00100
12  1061212 0.00066
13  2845130 0.00176
14  1151754 0.00071
15  2854771 0.00177
16  4366874 0.00271
17  3243345 0.00201
18  8082441 0.00501
19  5925749 0.00367
20  6627712 0.00411
21  3527369 0.00219
22  6282643 0.00389
23  8649813 0.00536
24  11327269    0.00702
25  17252728    0.01069
26  23676889    0.01467
27  19632016    0.01216
28  17542010    0.01087
29  31795702    0.01970
30  44642598    0.02765
31  67690229    0.04193
32  53267289    0.03300
33  66360275    0.04111
34  129894448   0.08047
35  164453500   0.10187
36  90828097    0.05627
37  129516469   0.08023
38  114018646   0.07063
39  140521421   0.08705
40  173852845   0.10770
41  183146339   0.11345


BQHIST file contents

lotte@Uni-HPC:~Learning$wc -l TrialSRR_minlen20.bqhist 102 TrialSRR_minlen20.bqhist lotte@Uni-HPC:~Learning$ cat TrialSRR_minlen20.bqhist
#BaseNum    count_1 min_1   max_1   mean_1  Q1_1    med_1   Q3_1    LW_1    RW_1
0   15983014    0   34  31.06   31  31  33  23  34
1   15983014    2   34  30.70   31  31  34  16  34
2   15983014    2   34  28.71   27  31  34  10  34
3   15983014    2   37  33.42   33  35  37  10  37
4   15983014    2   37  34.49   35  35  37  19  37
5   15983014    2   37  34.86   35  35  37  23  37
6   15983014    2   37  35.08   35  36  37  25  37
7   15983014    2   37  35.12   35  37  37  25  37
8   15983014    2   39  36.66   35  39  39  25  39
9   15983014    2   39  36.65   35  39  39  25  39
10  15983014    2   39  36.66   35  39  39  25  39
11  15983014    2   39  36.39   35  38  39  19  39
12  15983014    2   39  36.47   35  38  39  23  39
13  15983014    2   41  37.58   36  39  41  24  41
14  15983014    2   41  37.66   36  39  41  25  41
15  15983014    2   41  37.57   36  39  41  24  41
16  15983014    2   41  37.28   36  39  41  17  41
17  15983014    2   41  37.45   36  39  41  23  41
18  15983014    2   41  37.60   36  39  41  24  41
19  15983014    0   41  37.58   36  39  41  24  41
20  15983014    0   41  37.58   36  39  41  24  41
21  15983014    0   41  37.52   36  39  41  24  41
22  15983014    0   41  37.46   36  39  41  23  41
23  15983014    0   41  37.39   36  39  41  22  41
24  15983014    0   41  37.34   36  39  41  21  41
25  15983014    0   41  37.09   36  39  40  18  41
26  15983014    0   41  37.05   36  39  40  18  41
27  15983014    0   41  36.45   36  39  40  10  41
28  15983014    0   41  36.70   36  39  40  16  41
29  15983014    0   41  36.72   36  39  40  16  41
30  15983014    0   41  36.55   36  39  40  15  41
31  15983014    0   41  36.42   35  38  40  15  41
32  15983014    0   41  36.46   35  38  40  15  41
33  15983014    0   41  36.42   35  38  40  15  41
34  15983014    0   41  36.20   35  38  40  9   41
35  15983014    0   41  35.95   35  38  40  9   41
36  15983014    0   41  36.29   35  38  40  15  41
37  15983014    0   41  36.24   35  38  40  9   41
38  15983014    0   41  36.14   35  38  40  9   41
39  15983014    0   41  35.97   35  38  40  9   41
40  15983014    0   41  36.12   35  38  40  9   41
41  15983014    0   41  36.09   35  38  40  9   41
42  15983014    0   41  36.03   35  38  40  9   41
43  15983014    0   41  35.82   35  38  40  9   41
44  15983014    2   41  35.93   35  38  40  9   41
45  15983014    2   41  36.10   35  38  40  9   41
46  15983014    0   41  35.82   35  38  40  8   41
47  15983014    0   41  35.87   35  38  40  9   41
48  15983014    0   41  35.97   35  38  40  9   41
49  15983014    0   41  35.91   35  38  40  8   41
50  15983014    0   41  35.47   34  38  40  7   41
51  15983014    2   41  35.55   34  38  40  7   41
52  15983014    2   41  35.56   34  38  40  7   41
53  15983014    2   41  35.51   34  38  40  7   41
54  15983014    2   41  35.18   34  38  40  2   41
55  15983014    2   41  35.16   34  38  40  2   41
56  15983014    2   41  34.94   33  38  40  2   41
57  15983014    2   41  34.87   33  38  40  2   41
58  15983014    2   41  34.83   33  38  40  2   41
59  15983014    2   41  34.71   33  38  40  2   41
60  15983014    2   41  34.53   33  37  40  2   41
61  15983014    2   41  34.36   33  37  40  2   41
62  15983014    2   41  34.15   32  37  40  2   41
63  15983014    2   41  34.03   32  37  40  2   41
64  15983014    2   41  33.92   32  37  39  2   41
65  15983014    2   41  33.65   32  36  39  2   41
66  15983014    2   41  33.18   31  36  39  2   41
67  15983014    2   41  33.09   31  36  39  2   41
68  15983014    2   41  32.85   31  36  39  2   41
69  15983014    2   41  32.72   31  35  39  2   41
70  15983014    2   41  32.42   31  35  38  2   41
71  15983014    0   41  32.24   31  35  38  2   41
72  15983014    0   41  32.11   31  35  38  2   41
73  15983014    0   41  31.94   31  35  37  2   41
74  15983014    0   41  31.71   30  35  37  2   41
75  15983014    0   41  30.53   29  34  36  2   40
76  15983014    0   41  30.96   30  34  36  2   40
77  15983014    0   41  31.00   30  34  36  2   40
78  15983014    0   41  31.11   30  34  36  2   40
79  15983014    0   41  30.98   30  34  36  2   40
80  15983014    0   41  30.80   30  34  36  2   39
81  15983014    0   41  30.67   30  34  36  2   39
82  15983014    0   41  30.53   30  34  35  2   39
83  15983014    0   41  30.31   30  34  35  2   39
84  15983014    0   41  29.73   29  34  35  2   39
85  15983014    0   41  29.65   29  34  35  2   38
86  15983014    0   41  29.50   29  34  35  2   37
87  15983014    0   41  29.18   29  34  35  2   37
88  15983014    2   41  29.06   29  34  35  2   37
89  15983014    0   41  28.79   29  34  35  2   37
90  15983014    2   41  28.79   29  34  35  2   37
91  15983014    0   41  28.68   29  33  35  2   36
92  15983014    0   41  28.49   29  33  35  2   36
93  15983014    0   41  28.34   29  33  35  2   36
94  15983014    2   41  28.11   29  33  35  2   36
95  15983014    0   41  27.54   27  33  35  2   36
96  15983014    2   40  27.41   27  33  35  2   36
97  15983014    2   40  27.13   26  33  35  2   36
98  15983014    2   40  26.88   26  33  35  2   36
99  15983014    2   39  26.50   25  33  35  2   36
100 15983014    0   39  24.30   19  30  34  2   35


QHIST file contents

lotte@Uni-HPC:~Learning$wc -l TrialSRR_minlen20.qhist 102 TrialSRR_minlen20.qhist lotte@Uni-HPC:~Learning$ cat TrialSRR_minlen20.qhist
1   31.063  26.885
2   30.698  24.777
3   28.714  20.556
4   33.421  24.922
5   34.491  29.127
6   34.864  29.350
7   35.080  29.805
8   35.118  30.052
9   36.658  30.166
10  36.652  30.040
11  36.662  30.395
12  36.389  28.285
13  36.465  29.323
14  37.579  29.032
15  37.663  30.035
16  37.574  29.065
17  37.282  27.307
18  37.446  29.248
19  37.598  29.421
20  37.576  29.258
21  37.578  29.198
22  37.519  28.767
23  37.462  28.431
24  37.387  27.753
25  37.338  27.351
26  37.086  25.540
27  37.050  25.438
28  36.447  23.315
29  36.702  24.383
30  36.717  24.062
31  36.555  23.301
32  36.422  22.903
33  36.459  22.835
34  36.423  22.594
35  36.197  21.844
36  35.950  21.385
37  36.290  21.849
38  36.243  21.372
39  36.145  20.975
40  35.969  20.686
41  36.124  20.852
42  36.087  20.575
43  36.031  20.427
44  35.823  19.860
45  35.934  20.178
46  36.099  20.175
47  35.817  18.978
48  35.872  19.512
49  35.970  19.451
50  35.906  19.198
51  35.474  18.051
52  35.548  18.662
53  35.564  18.477
54  35.506  18.265
55  35.178  17.768
56  35.164  17.730
57  34.940  17.352
58  34.868  17.281
59  34.826  17.139
60  34.712  16.925
61  34.534  16.697
62  34.362  16.509
63  34.147  16.268
64  34.035  16.146
65  33.921  15.996
66  33.645  15.748
67  33.178  15.393
68  33.091  15.371
69  32.846  15.159
70  32.724  15.043
71  32.425  14.815
72  32.241  14.697
73  32.106  14.588
74  31.937  14.460
75  31.713  14.296
76  30.530  14.150
77  30.962  14.091
78  30.998  13.886
79  31.111  13.882
80  30.976  13.727
81  30.800  13.572
82  30.667  13.455
83  30.529  13.311
84  30.307  13.133
85  29.732  12.769
86  29.653  12.709
87  29.497  12.535
88  29.178  12.281
89  29.063  12.139
90  28.794  11.890
91  28.790  11.777
92  28.681  11.583
93  28.490  11.350
94  28.342  11.122
95  28.105  10.832
96  27.542  10.419
97  27.407  10.170
98  27.129  9.831
99  26.876  9.495
100 26.500  9.128
101 24.302  8.607


AQHIST file contents

lotte@Uni-HPC:~Learning$wc -l TrialSRR_minlen20.aqhist 42 TrialSRR_minlen20.aqhist lotte@Uni-HPC:~Learning$ cat TrialSRR_minlen20.aqhist
#Quality    count1  fraction1
0   0   0.00000
1   0   0.00000
2   1003    0.00006
3   40192   0.00251
4   128895  0.00806
5   199688  0.01249
6   226919  0.01420
7   213066  0.01333
8   174075  0.01089
9   206481  0.01292
10  216097  0.01352
11  226341  0.01416
12  228745  0.01431
13  249807  0.01563
14  258218  0.01616
15  227851  0.01426
16  229582  0.01436
17  253982  0.01589
18  238962  0.01495
19  307747  0.01925
20  300699  0.01881
21  387786  0.02426
22  478053  0.02991
23  416609  0.02607
24  464175  0.02904
25  537010  0.03360
26  619228  0.03874
27  531987  0.03328
28  577129  0.03611
29  637311  0.03987
30  330176  0.02066
31  416774  0.02608
32  542824  0.03396
33  699675  0.04378
34  771215  0.04825
35  881653  0.05516
36  1175349 0.07354
37  1504781 0.09415
38  974260  0.06096
39  108547  0.00679
40  122 0.00001

bbduk bbmap histograms • 907 views
4
Entering edit mode
2.5 years ago
h.mon 34k

1. It seems bbduk is inconsistent in its indexing scheme, BQHIST uses zero-based index, and QHIST uses one-based index. I wouldn't call them "machine coding" vs "human coding", although it is true one-based systems are easier for humans to interpret, and zero-based systems are handier for computers to calculate.

2. Current version of Illumina FASTQ assigns scores between 0-40 to raw sequencing reads - see FASTQ format. Base qualities can be go above 40 in certain situations, e.g., when overlapping forward and reverse reads.

3. Probably rounding error.

4. I will let someone more knowledgeable answer this one.

5. Yes, but the drop in positions 2-3 is not very common.

6. In the reformat tool of BBmap what is the meaning of Read1_linear and Read1_log

7. Your interpretation seems correct, but here again there is likely rounding error, as the third column tallies to 0.999980.

0
Entering edit mode

Thank you so very much for easy to understand answers not loaded with excessive jargon or technically complexities.

I think the only question yet to be answered is: what are LW_1 and RW_1? Hopefully someone answers that as well.

We do have one follow-up question - I think you meant to explain that PHRED score drop at position 1 is expected, but not so much at position 2 and 3, right? If that is the case, then is this symptomatic of some sort of sequencing or processing problem that can be fixed at the processing step, before mapping these RNA-seq reads to a reference genome? Please advice.

BTW, forum members here so awesome, friendly and helpful :) Thanks again to forum members and maintainers...

0
Entering edit mode

LW_1 and RW_1 are the boxplot whisker values, set at the 2nd and 98th percentiles.

I wouldn't bother too much at the initial drop in quality, as it is short (positions 2 and 3) and not very strong (average quality is still 20). You could have asked for the base composition histogram (bhist) to check if there is an acumulation of Ns at these positions.

0
Entering edit mode

I suppose LW and RW might be acronyms for left and right whiskers respectively.

We will inquire for N % in the histogram results. Thanks for that suggestion.

Sorry for the late response, we all had papers, exams etc :)

4
Entering edit mode
2.5 years ago

2) Minor correction: the top PHRED score is 41 (ASCII character 'J' in Sanger format).

4) The instrument is programmed to sequence +1 cycle relative to the desired read length. This additional nucleotide is used for phasing/prephasing base-call correction.

5) Quality scores are typically lower for the first few cycles of sequencing.

A detailed discussion of point's 4 & 5 can be found here.

0
Entering edit mode

Thank you so very much for easy to understand answers not loaded with excessive jargon or technically complexities.

I think the only question yet to be answered is: what are LW_1 and RW_1? Hopefully answers that as well.

Forum members here so awesome, friendly and helpful :) Thanks again to forum members and maintainers...