Question: How To Interpret Output From Bam-Readcount
1
gravatar for merbergd
7.2 years ago by
merbergd10
merbergd10 wrote:

After reading the documentation at https://github.com/genome/bam-readcount/blob/master/README.textile there is one column in the output from bam-readcount that I still don't understand.

Columns 1-4 are chromosome, position, reference base, readcount. Columns 6-9 are colon separated values reflecting counts and quality metrics for reads with A, C, G, T, and ambiguous calls at the current position.

But what is column 5? In my output, it begins with an "=", then every colon separated value is "0" in each row.

Here is a sample row:

1 897037 C 27 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 C:25:51.36:30.44:35.72:12:13:0.44:0.01:8.20:13:0.41:74.48:0.51 G:2:60.00:32.50:37.00:1:1:0.58:0.01:32.50:1:0.24:76.00:0.28 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00

Thank you very much.

samtools • 6.2k views
ADD COMMENTlink modified 7.2 years ago by ernfrid390 • written 7.2 years ago by merbergd10

Interestingly from going through the source code, it looks like that the "=" call should only have counts when bam1_seqi returns a 0, which shouldn't happen.

ADD REPLYlink written 7.2 years ago by Devon Ryan97k

I'm trying to understand bam1_seqi from the samtools distribution and the documentation at http://samtools.sourceforge.net/samtools/bam/PDefines/PDefines.html. Is it correct that bam1_seqi typically returns a base (or more accurately, a number that represents a base). Could it be that it returns an "=' when there is an error condition?

ADD REPLYlink modified 14 months ago by _r_am31k • written 7.2 years ago by merbergd10

My comment was probably not very informative, sorry about that! From perusing the source code for bam-readcount, it seems that legitimate basecalls are "=", "A", "C", "G", "T", and "N". I had sort of assumed that "=" might mean "same as the reference", which would be useful, however there's nothing I could find to indicate that. Instead, what I found is that, for each read covering some base in the output in a pileup (from samtools, bam-readcount uses its C API), it finds the position in the read covering it and increments the various statistics according to the nucleotide at that position. It gets the nucleotide from a call to bam1_seqi on each read. bam1_seqi should only return 1, 2, 4, 8, and 15 (for A, C, G, T and N, respectively). The metrics for "=" will be updated when a 0 is returned, but that shouldn't occur. I wonder if this is just a place-holder for intended functionality then (unless I just missed something in the source code). If you don't get a reply from the author of bam-readcount in the next day or so, perhaps send him/her an email, since this seems undocumented.

BTW, your suspicion of only updating "=" on an error would be another good explanation.

ADD REPLYlink modified 14 months ago by _r_am31k • written 7.2 years ago by Devon Ryan97k

Thank you for a very detailed reply!

ADD REPLYlink written 7.2 years ago by merbergd10
2
gravatar for ernfrid
7.2 years ago by
ernfrid390
Saint Louis
ernfrid390 wrote:

The intent was to support BAM files where basecalls matching the reference are reported as an =. Based on the samtools code, bam_seqi should return a 0 if this character is used. Rather than doing the correct thing and interpreting those as the actual reference base, this was implemented as a separate bin to specifically track those bases reported as '='. I have yet to encounter a BAM encoded in this fashion and so the '=' section continues to sit by itself as speculative code. I'm loathe to remove it at this point as I don't want to break anyone's parsers unnecessarily.

ADD COMMENTlink modified 14 months ago by _r_am31k • written 7.2 years ago by ernfrid390

Out of curiosity, which version of samtools was this output of bam1_seqi based on? The recent versions of samtools that I've seen don't utilize a reference fasta here, but it'd be good to know if this was the case in some older version.

ADD REPLYlink modified 14 months ago by _r_am31k • written 7.2 years ago by Devon Ryan97k
1

Well, I'm basing this on the current and past pileup code: https://github.com/samtools/samtools/blob/develop/bam_plcmd.c#L74-L77 and not the implementation of bam1_seqi. It may be that the bam1_seqi has moved on from this functionality and what's in pileup_seq is cruft.

ADD REPLYlink written 7.2 years ago by ernfrid390

Interesting, I guess I'd never looked at the pileup code. Thanks!

ADD REPLYlink written 7.2 years ago by Devon Ryan97k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1863 users visited in the last hour