Estimation sequencing sustitution error rate
2
0
Entering edit mode
9.2 years ago

Hi, I have data from a mtDNA sequencing on a 316 Ion PGM Chip. Sequence reads where mapped to rCRS reference assembly.

With mitoSeek I get a table that looks like this

Chr                                               Loc    Ref    A    T    C    G    a    t    c    g
gi|251831106|ref|NC_012920.1|    1    G    0    0    0    19    0    0    0    91
gi|251831106|ref|NC_012920.1|    2    A    27    0    0    0    92    0    0    0

Up to Loc 16569 (reference size), so now I wish to get the sequencing sustitution error rate, for this I write a perl script that

a = Count the number or reads that not match the reference base at the position

b = sum all the reads covering the position and

c = compute the ratio of not matching reference reads vs all reads covering the position

Do this for all the 16569 positions

d=d+c

then to get the sequencing substition error rate I do d/16569, I get a value of 0.0013711611325426

Is this procedure ok? or I'm missing something obvious

best

sequencing-error-rate • 1.9k views
ADD COMMENT
0
Entering edit mode
9.2 years ago

What does this approach do with reads containing indels? And also, reads unmapped due to a high error rate will not be counted, making the rate seem artificially lower.

To quantify substitution (and other error type) rates, I recommend mapping with BBMap; it will print them to the screen like this:

Read 1 data:            pct reads       num reads       pct bases          num bases

mapped:                  62.3760%           31188        62.3760%            3118800
unambiguous:             61.0900%           30545        61.0900%            3054500
ambiguous:                1.2860%             643         1.2860%              64300
low-Q discards:           0.0000%               0         0.0000%                  0

perfect best site:        4.7260%            2363         4.7260%             236300
semiperfect site:         4.7260%            2363         4.7260%             236300
rescued:                  0.0000%               0

Match Rate:                   NA               NA        92.2582%            2877852
Error Rate:              92.4234%           28825         7.7418%             241495
Sub Rate:                92.4234%           28825         7.7219%             240872
Del Rate:                 0.1379%              43         0.0175%                547
Ins Rate:                 0.1539%              48         0.0024%                 76
N Rate:                   0.0000%               0         0.0000%                  0

It can also print out the correlation between base quality value and error rate, and the error rate by read position.

ADD COMMENT
0
Entering edit mode
9.2 years ago

I would recommend you to first collect those "c" values, build their distribution and have a look at them. You are trying to compute the mean value, while it would not be a very good estimator in case the distribution of "c" values is skewed (e.g. lognormal).

Another good idea would be to split those values in a matrix (A->G, A->C, ...), as substitution rates usually have a strong transition:transversion bias.

ADD COMMENT

Login before adding your answer.

Traffic: 1891 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6