After reading the documentation at https://github.com/genome/bam-readcount/blob/master/README.textile there is one column in the output from bam-readcount that I still don't understand.
Columns 1-4 are chromosome, position, reference base, readcount. Columns 6-9 are colon separated values reflecting counts and quality metrics for reads with A, C, G, T, and ambiguous calls at the current position.
But what is column 5? In my output, it begins with an "=", then every colon separated value is "0" in each row.
Here is a sample row:
1 897037 C 27 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 C:25:51.36:30.44:35.72:12:13:0.44:0.01:8.20:13:0.41:74.48:0.51 G:2:60.00:32.50:37.00:1:1:0.58:0.01:32.50:1:0.24:76.00:0.28 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00
Thank you very much.
Interestingly from going through the source code, it looks like that the "=" call should only have counts when bam1_seqi returns a 0, which shouldn't happen.
I'm trying to understand bam1_seqi from the samtools distribution and the documentation at http://samtools.sourceforge.net/samtools/bam/PDefines/PDefines.html. Is it correct that bam1_seqi typically returns a base (or more accurately, a number that represents a base). Could it be that it returns an "=' when there is an error condition?
My comment was probably not very informative, sorry about that! From perusing the source code for bam-readcount, it seems that legitimate basecalls are "=", "A", "C", "G", "T", and "N". I had sort of assumed that "=" might mean "same as the reference", which would be useful, however there's nothing I could find to indicate that. Instead, what I found is that, for each read covering some base in the output in a pileup (from samtools, bam-readcount uses its C API), it finds the position in the read covering it and increments the various statistics according to the nucleotide at that position. It gets the nucleotide from a call to bam1_seqi on each read. bam1_seqi should only return 1, 2, 4, 8, and 15 (for A, C, G, T and N, respectively). The metrics for "=" will be updated when a 0 is returned, but that shouldn't occur. I wonder if this is just a place-holder for intended functionality then (unless I just missed something in the source code). If you don't get a reply from the author of bam-readcount in the next day or so, perhaps send him/her an email, since this seems undocumented.
BTW, your suspicion of only updating "=" on an error would be another good explanation.
Thank you for a very detailed reply!