I have a wgs dataset and when I attempt use it with sambamba depth command, it gives sambamba-depth: All files must be coordinate-sorted error. What is the reason for this and why coordinate sorting is required?
In an unsorted BAM file, reads can be in any random order. In a co-ordinate sorted BAM file, reads are in the order in which they map to the reference genome. When they're sorted that way, to find a depth at a certain position, the program only needs to navigate to that position and account for all reads that exist at that position. As soon as a read that maps to the next position is found, the algorithm can stop looking.
In an unsorted BAM, the algorithm will need to look at every single read in the entire file before it's sure that all reads aligned to the position of interest are accounted for.
If your input has 500 positions, the sorted approach will mean going through the file once, jumping to each of the 500 positions The unsorted approach will mean going through the file 500 times, which is extremely unproductive.