I have encountered similar questions elsewhere, but none exactly replicate my problem.
I ran bam-readcount for about 1000 sites (across all chromosomes) for two samples from the same individual. I know both bam files have reads at the sites in question, as these sites were called as variant by multiple somatic callers (mutect, varscan, etc). Yet, for one sample I get an empty output from bam-readcount, while for the other I get the expected counts. To illustrate the problem with 1 site, see below.
First, I make the input site list:
echo "1 884091 884091" | tr ' ' '\t' > test.bed
Second, I run bam-readcount:
bam-readcount --reference-fasta ref.fa --site-list test.bed sample1.bam > test1
bam-readcount --reference-fasta ref.fa --site-list test.bed sample2.bam > test2
When I do this, we see test1 is an empty file, whereas test2 has counts for the site.
Third, as a sanity chec, I use samtools to subset these bams using the same site list. This is to verify this site is indeed covered in both bam files:
samtools view -L test.bed sample1.bam > test3
samtools view -L test.bed sample2.bam > test4
When we do this, we see there is output for both files:
head -1 test3 test4
==> test3 <== HWI-D00310:198:CALR0ANXX:5:1116:18839:21292 99 1 883969 60 125M = 883971 127 GACCTGCTGGAACATCTGCCCCAAGGGCCGTGTCAGGCTCTCTCGGCCCCATGCCTGGTCACCCTGGCTTCACCCTGGCTGCACCCTGGTCCCCCTGGTCCCTTTGGCCCTGCACCTGGCTGCAC >??ADEEEGECCDDCEEGEDDDDCFEEED@BGBEDFEEEEEEEE@EEDDDDCGECEGEBEDEDDEGEEECEDEDDEFEEEGEDECDEGEBEDDDDEGCBDDDECCGEEDDEGEDEDEGEDCB?>@ MC:Z:125M BD:Z:KKMLKNNNMMLMMLMNKLLKIHJJKLHJJIJJILKMLJKJJJJJKJJJHHJLLKJHKKJLKJKHHKKJKIKKJKHHKKJKKKLJKHHKKJLKHHHILLKMLIIJEKLKKIILLMKLJMMLNLLMK MD:Z:125 RG:Z:BCALR0ANXX.5.TCCTGAGC-ACTGCATA BI:Z:PPSRUWUVVSTRROSSRTSQMMSPQPMRPPROOSRTPSTQRQRQTQSQMMSSTSQRTRSTSPRMRTRSTQRSPRMRTRSTTSSPRMRTRSTQNNNSUSTURNSRKTSTROTVUURTTVTUVTSSP NM:i:0 BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ MQ:i:60 AS:i:125 XS:i:19
==> test4 <== HWI-D00310:198:CALR0ANXX:4:2308:5529:20058 99 1 883970 60 125M = 883992 147 ACCTGCTGGAACATCTGCCCCAAGGGCCGTGTCAGGCTCTCTCGGCCCCATGCATGGTCACCCTGGCTTCACCCTGGCTGCACCCTGGTCCCCCTGGTCCCTTTGGCCCTGCACCTGGCTGCACC AA@12DE1@/ADCDFFFA-.AC=E/DED>AFAECGACFA0CDD?CD7?CDDGBABFEBFCBCEFGCFEB/CB.?DGEFFCF.DAEDGE-B/AEEDD<9EABFCBGDE//=0?0DDD0DCDD?>@@ MC:Z:125M BD:Z:KKLJPPNOMLMMJMNKLLKHHJJJLGIJHJIHKKMLJKJJJJJKKJKHHJLLLLLLKJLKJKHILKJKIKKJKHILKJKLLLJKHILKJLKHHHILKJLKIJJEKLKLIJMMMKLJNMLMOMMKL MD:Z:53C71 RG:Z:BCALR0ANXX.4.TCCTGAGC-TATCCTCT BI:Z:PPQRUUVTSSQQORSPRRPLLQNQPMRPPROORQRPRSOPOPORPRPLLQQRRRQRQRRQNPLQRQSTPRROQLRSRSTSSSOQMRSRSSQMMMRSRSSQMRPJSSTRNSTTTPSTUTUVUSSOQ NM:i:1 BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ MQ:i:60 AS:i:120 XS:i:20
If anyone has seen similar behavior and has any suggestions, would be greatly appreciated.
https://github.com/genome/bam-readcount
Could that be a problem that your list of regions is in bed-format?
The file is formatted as you describe. Furthermore, I don't believe it is a problem with the region list file because this file is working for sample2.bam
That's in bed-format!
you never know the internal working mechanisms of such programs, so it is better to follow exactly what is being said in the manual
Again, the test.bed file in my example is formatted as the manual instructs: three tab-separated columns (chr, start, end). The start and end having the same position does not matter for bam-readcount. To be sure, I have run this using different iterations of the region list file and the same issue occurs. Its very strange
Please use
ADD COMMENT
orADD REPLY
to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.