Question

Estimation sequencing sustitution error rate

0

Entering edit mode

9.2 years ago

cristianrohr768 ▴ 30

Hi, I have data from a mtDNA sequencing on a 316 Ion PGM Chip. Sequence reads where mapped to rCRS reference assembly.

With mitoSeek I get a table that looks like this

Chr                                               Loc    Ref    A    T    C    G    a    t    c    g
gi|251831106|ref|NC_012920.1|    1    G    0    0    0    19    0    0    0    91
gi|251831106|ref|NC_012920.1|    2    A    27    0    0    0    92    0    0    0

Up to Loc 16569 (reference size), so now I wish to get the sequencing sustitution error rate, for this I write a perl script that

a = Count the number or reads that not match the reference base at the position

b = sum all the reads covering the position and

c = compute the ratio of not matching reference reads vs all reads covering the position

Do this for all the 16569 positions

d=d+c

then to get the sequencing substition error rate I do d/16569, I get a value of 0.0013711611325426

Is this procedure ok? or I'm missing something obvious

best

sequencing-error-rate • 1.9k views

ADD COMMENT • link updated 24 months ago by Ram 43k • written 9.2 years ago by cristianrohr768 ▴ 30

Ram · Answer 1 · 2015-02-25

What does this approach do with reads containing indels? And also, reads unmapped due to a high error rate will not be counted, making the rate seem artificially lower.

To quantify substitution (and other error type) rates, I recommend mapping with BBMap; it will print them to the screen like this:

Read 1 data:            pct reads       num reads       pct bases          num bases

mapped:                  62.3760%           31188        62.3760%            3118800
unambiguous:             61.0900%           30545        61.0900%            3054500
ambiguous:                1.2860%             643         1.2860%              64300
low-Q discards:           0.0000%               0         0.0000%                  0

perfect best site:        4.7260%            2363         4.7260%             236300
semiperfect site:         4.7260%            2363         4.7260%             236300
rescued:                  0.0000%               0

Match Rate:                   NA               NA        92.2582%            2877852
Error Rate:              92.4234%           28825         7.7418%             241495
Sub Rate:                92.4234%           28825         7.7219%             240872
Del Rate:                 0.1379%              43         0.0175%                547
Ins Rate:                 0.1539%              48         0.0024%                 76
N Rate:                   0.0000%               0         0.0000%                  0

It can also print out the correlation between base quality value and error rate, and the error rate by read position.

Ram · Answer 2 · 2015-02-25

I would recommend you to first collect those "c" values, build their distribution and have a look at them. You are trying to compute the mean value, while it would not be a very good estimator in case the distribution of "c" values is skewed (e.g. lognormal).

Another good idea would be to split those values in a matrix (A->G, A->C, ...), as substitution rates usually have a strong transition:transversion bias.