Getting The Error Probability Statistics From A (Large) Fastq File
3
1
Entering edit mode
11.1 years ago

Hi! I wonder if any of you knows a efficient way to get the error probability statistics from a FASTQ file for each base position, something similar to FASTX Statistics output, but with the "mean quality error probability" instead of "mean quality score value".

I know that:

P = 10**(-Q/10)   #where P = error probability and Q = quality score


But for n reads

10**(-mean_Q/10) != (10**(-Q1/10) + 10**(-Q2/10) .... 10**(-Q(n-1)/10) + 10**(-Qn/10))/n


So it isn't right calculate the error probability from the mean quality error probability!!!

I have been writing a python script to do this:

import sys
from Bio import SeqIO
from Bio.Seq import Seq

def ErrorCal(fastqfile):

f = open(fastqfile)

Ptotal = []

for record in SeqIO.parse(f, "fastq"):
Pseq = []

for Q in record.letter_annotations["phred_quality"]:    # A usefull biopython function to get the scores from FASTQ
P = 10**(float(-Q)/10)
Pseq = Pseq + [P]

Ptotal = Ptotal + [Pseq]

for l in range(100):     #My reads are no longer than 100 nt
Pn = []
for P in Ptotal:

try:
Pn = Pn + [P[l]]
except IndexError:          #To avoid index errors
pass

print l, len(Pn), float(sum(Pn))/float(len(Pn))

if __name__ == '__main__':
ErrorCal(sys.argv[1])


And I get a output file like this:

# base position, count and mean quality score value for this column
0 10000 0.00478373679631
1 10000 0.00643301993251
2 10000 0.00603326958587
3 10000 0.00661667482542
4 10000 0.00698002326627
5 10000 0.00748116353249
6 10000 0.00750653103859
7 10000 0.00854730048805
8 10000 0.00782628516627
9 10000 0.00799688282585
10 10000 0.0100680081219
11 10000 0.010554252361
12 10000 0.0115042653879
13 10000 0.0110789410521
14 10000 0.0117936283471
15 10000 0.0135361334883
16 10000 0.0133674453765
17 10000 0.0135114391479
18 10000 0.0141544563
19 10000 0.0149347758654
20 10000 0.0168404463873
21 10000 0.0169130212498
22 10000 0.0170759788356
23 10000 0.0173747545091
24 10000 0.0184760229549
25 10000 0.024253460544
26 10000 0.0245203287266
27 10000 0.0244003522836
28 10000 0.0249417140232
29 10000 0.0261188559013
30 10000 0.0291030174931
31 10000 0.0282340994484
32 10000 0.0291518845385
33 10000 0.0291175488217
34 10000 0.0306407375192
35 10000 0.0339538051169
36 10000 0.0334696957594
37 10000 0.0344348970707
38 10000 0.0348958793417
39 10000 0.036515198508
40 10000 0.0418699442861
41 10000 0.0420951993255
42 10000 0.0429189013631
43 10000 0.0431318680288
44 10000 0.0456098420762
45 10000 0.049845396868
46 10000 0.0497987016425
47 10000 0.0515204547988
48 10000 0.0528480344261
49 10000 0.0567458374262
50 9863 0.0623504979446
51 9743 0.0620382298456
52 9623 0.0619351347882
53 9501 0.0629058622748
54 9383 0.0700486609625
55 9244 0.0757105048112
56 9114 0.0756137495837
57 8986 0.0762488957032
58 8856 0.0759760367935
59 8716 0.0774635223647
60 8575 0.0922178834682
61 8442 0.0929096892755
62 8317 0.0919743932008
63 8184 0.0921488166949
64 8024 0.0957611020664
65 7872 0.106813164177
66 7745 0.107209713379
67 7593 0.107154062214
68 7434 0.109305982214
69 7304 0.1141252995
70 7125 0.128521819456
71 6967 0.128121744318
72 6823 0.127207017742
73 6664 0.127629161143
74 6544 0.137642486172
75 6389 0.162564673408
76 6252 0.163517319618
77 6112 0.158330043999
78 5965 0.157530635622
79 5812 0.161409164248
80 5679 0.173270040725
81 5548 0.173095663172
82 5405 0.17457130032
83 5299 0.176796373312
84 5158 0.185559331665
85 5030 0.209944257038
86 4880 0.208303182059
87 4711 0.208075029122
88 4576 0.213621562692
89 4455 0.237732658313
90 4328 0.262654493301
91 4233 0.267632810012
92 4131 0.269002973405
93 4050 0.27254880657
94 3971 0.287590375753
95 3919 0.332464474206
96 3884 0.336350390258
97 3873 0.340828674427
98 3873 0.347796213406
99 3873 0.364653851719


The big problem with my code is that it takes 37.343s to process 10,000 reads and I need to process about 900,000,000 reads, so, it will be take more than a month/CPU to complete.

Can you give an advise with my python code or tell me about a tool to do it faster?

fastq python next-gen sequencing statistics • 5.2k views
5
Entering edit mode
11.1 years ago
Ryan Dale 5.0k

Try this version . . . Using a test file of 10,000 36-bp reads, this version 22x faster:

• original version: 12.8s
• this version: 0.59s

The output is exactly the same. The trick is to minimize how much looping you're doing. Here, I'm not creating a new list each time, just repeatedly updating the same list of cumulative sums and incrementing another list if there's a base in that position. Then, at the end, these two lists are used to print the report.

import sys
from Bio import SeqIO
from Bio.Seq import Seq

def ErrorCal(fastqfile):

f = open(fastqfile)

prob_sums = [0 for i in range(MAX_READ_LENGTH)]
prob_pos = prob_sums[:]
for record in SeqIO.parse(f, "fastq"):
for bp, Q in enumerate(record.letter_annotations["phred_quality"]):
P = 10**(float(-Q)/10)
prob_sums[bp] += P
prob_pos[bp] += 1.0

for bp, data in enumerate(zip(prob_sums, prob_pos)):
prob_sum, prob_pos = data
prob_mean = prob_sum / prob_pos
print bp, prob_pos, prob_mean

if __name__ == '__main__':
ErrorCal(sys.argv[1])

0
Entering edit mode

WOW!!! with the same input file it takes just 1.795s!! Thanks for the corrections...since not too long I has been learning python... THANKS!

5
Entering edit mode
11.1 years ago

Precompute 10**(-Q) in an array will result in extra speedup.

logQs = [10 ** (float(-Q) / 10) for Q in range(MAX_Q + 1)]  # Precompute
for record in SeqIO.parse(f, "fastq"):
for bp, Q in enumerate(record.letter_annotations["phred_quality"]):
prob_sums[bp] += logQs[Q]  # Fetch
prob_pos[bp] += 1.0


EDIT: delete the comment about OFFSET.

0
Entering edit mode

Ah nice, good call.

0
Entering edit mode

why not adding the exponents and calculate (10 ** sum) / length in the end?

0
Entering edit mode

Thanks Haibao, your improvement works nice.

5
Entering edit mode
11.1 years ago
lh3 33k

Give you a sense of the difference between C and Python. You need to download kseq.h and compile the program at the end (named as avgq.c) with:

gcc -g -Wall -O2 avgq.c -o avgq -lm -lz


On a 400,000-line fastq, daler's version with Haibao's improvement takes 8.7 CPU seconds, while the C version takes 0.31 seconds, more than 20X faster. The bottleneck of the python script is to loop through all the qualities. C is by far faster for this loop. The C program is not much longer than the python script.

Note that for huge fastq, floating-point underflow may occur. You need to use two err[] arrays, one for the partial sum and the other for the complete sum.

#include <zlib.h>
#include <math.h>
#include <stdio.h>
#include "kseq.h"

#define MAX_LEN 1024
int main(int argc, char *argv[])
{
gzFile fp;
kseq_t *seq;
long i, cnt[MAX_LEN];
double err[MAX_LEN], table[128];

fp = gzopen(argv[1], "r");
seq = kseq_init(fp);
for (i = 0; i < 128; ++i) table[i] = pow(10, -i/10.);
for (i = 0; i < MAX_LEN; ++i) cnt[i] = 0, err[i] = 0.;
while (kseq_read(seq) >= 0)
for (i = 0; i < seq->qual.l; ++i)
++cnt[i], err[i] += table[seq->qual.s[i] - 33]; // watch out underflow!
for (i = 0; i < MAX_LEN && cnt[i]; ++i)
printf("%d\t%ld\t%f\n", i, cnt[i], err[i] / cnt[i]);
kseq_destroy(seq);
gzclose(fp);
return 0;
}

0
Entering edit mode

well, I always have heard (mainly from my dad) that C is much more faster than scripting languages... Now I see the empirical evidence of that. Thanks!