Given a list of simulated read-pairs, and the sorted (by read-id) bam output of STAR and HISAT, I would like to evaluate the mapping quality by assigning one of the following categories: wrong chromosome, partial correct and perfectly mapped to only the primary alignments reported. I would like to do this with htsjdk in Java.
I need help with the following questions, as I do not completely understand the bamfile documentation.
-In a sorted bam file, will the second mate always be directly after the corresponding first mate of a read-pair?
-What if one of those mates is unmapped? Will it still be stored or could it happen, that there is no record at all?
-What if both mates are unmapped? Will they still be reported or simply ignored?
So the general idea would be to simply read the list of read pairs through without checking for the id and compare the results of the corresponding record in the sorted bam, which would be possible, if, in every case, the read-pair gets at least one record for both mates, even if one or both are unmapped. So does this work?
Thanks for all suggestions.
Thanks for this information, one little thing is still bothering me now: I removed all alignments which are not primary now and managed to determine the correct read pairs. I store the first discovered read first, and the second read after. According to your explanation, it could happen now, that read1 is actually the second one stored and vice versa. Am I right? So, in order to do this right, I still need to check the first-of-pair-flag?
That's correct. They are guaranteed to be name-sorted, but that's all. Since read 1 and read 2 will have identical names, there's no way pure name-sorting can ensure that read 1 comes first.
Ok thanks a lot, it's working fine now.