When using bedtools genomecov max function I get values reported for some chromosomes/reference sequences which are above the max depth set. This is not consistent between files and always occurs on the last base or two of the reference genome. The number of chromosomes affected varies between bam files.
For example using code:
bedtools genomecov -ibam In.bam -bga -max 1 > out.txt
for this particular file I got all 734562 lines reported as either 0 or 1 (covered vs uncovered) with the exception of 7 lines:
NC_011203.1 9997 9998 1
NC_011203.1 9998 9999 1
NC_011203.1 9999 10000 1
MH558113.1 35932 35933 7
AC_000008.1 35937 35938 8
NC_001405.1 35936 35937 8
AC_000017.1 36000 36001 9
HQ003817.1 35817 35818 9
HQ413315.1 35758 35759 9
MH121097.1 35750 35752 92
I have looked at the bam on IGV and there are 92 aligned reads at bases 35750 to 35752 for MH121097.1and the reference genome is 35752bp in length.
Does anyone know how I can fix this or where I can get some help? Should I report as an issue on GitHub (I was wary of doing so in case It was something I was doing wrong)?
For context: my bam file contains reads aligned against 104 different reference sequences using bowtie2 all. I am trying to find a way to write a script which could easily tell me which of these reference genomes are completely covered by my reads to look for mixed infections without having to do it manually. I think the way I am trying to do it is quite convoluted though and I may be missing a simpler solution. My steps are:
bedtools genomecov -ibam In.bam -bga -max 1 > out.txt
awk -F"\t" '!seen[$1, $4]++' out.txt > awk.txt
awk to keep lines which are unique in both the chromosome name and the depth - so keep chromosome name with coverage =0 and chromosome name with coverage =1 meaning I am left with one instance of covered and one instance of an area uncoveredawk '{print $1}' awk.txt > awk2.txt
remove all columns except the chromosome nameuniq -u awk2.txt > uniq.txt
use uniq to identify unique chromosome names (those which only have an entry with a 1 representing entirely covered genomes)
This works very well when I don't get these odd max depths which are over the defined max value of 1 but if anyone can think of a simpler way I can achieve this then I would also be very grateful for your help
Thanks
Thank you - I hadn't noticed that bit. It does work though - completely on some of my bam files with the same parameters set and across the majority of the rest of the files with the exception of the last base or two though? Surely if it didn't work for those options then my files would be reporting their actual coverage which is significantly higher than 1. Am I missing something?