I want to determine the substitution error rate and the indel error rate for a given BAM file.
I've been reading over the following article.
First off I just want to confirm my definitions of a substitution error and indel error are correct.
A substitution error would be when the sequencer substitutes a different base than the actual base in the sequence being sequenced. So if the actual sequence was ..ATGG.. but the sequencer read ..ACGG.., then the reading of C instead of a T would be a substitution error, since one of the reads has a C which shouldn't be there.
An indel error would be when the sequencer deletes or inserts a base that is different from the actual sequence being looked at. So if the actual sequence was ..GATG.. and the sequencer read ..GTG.. then the A being deleted would be an indel error., since the read is missing an A that should be there.
Assuming those are right, I am also confused with how to determine them from a BAM file. If a tool exists then that would be great
For substitution I know it would involve the quality score for a given base and the number of mismatches. By mismatches I just mean if a given base has a coverage of 100X and 98 are T but the other two aren't, then there are 2 mismatches for that position. I'm just not sure how exactly I would combine these to find a rate.
For indel errors I know it would involve homopolymers like mentioned in the paper but I have no idea how to find the rate.