Evaluation of simulated paired-read data: How are records stored in Bam files?
1
0
Entering edit mode
7.3 years ago
m.picciani • 0

Given a list of simulated read-pairs, and the sorted (by read-id) bam output of STAR and HISAT, I would like to evaluate the mapping quality by assigning one of the following categories: wrong chromosome, partial correct and perfectly mapped to only the primary alignments reported. I would like to do this with htsjdk in Java.

I need help with the following questions, as I do not completely understand the bamfile documentation.

-In a sorted bam file, will the second mate always be directly after the corresponding first mate of a read-pair?

-What if one of those mates is unmapped? Will it still be stored or could it happen, that there is no record at all?

-What if both mates are unmapped? Will they still be reported or simply ignored?

So the general idea would be to simply read the list of read pairs through without checking for the id and compare the results of the corresponding record in the sorted bam, which would be possible, if, in every case, the read-pair gets at least one record for both mates, even if one or both are unmapped. So does this work?

Thanks for all suggestions.

alignment rna-seq • 1.2k views
ADD COMMENT
1
Entering edit mode
7.3 years ago

In a sorted bam file, will the second mate always be directly after the corresponding first mate of a read-pair?

Not in general, but for id-sorted reads, pairs should be adjacent, if you only include primary alignments. You can tell whether the read is read 1 or read 2 in a pair by examining the flag field.

What if both mates are unmapped? Will they still be reported or simply ignored?

Unmapped reads are generally reported; some programs allow you to remove them if desired.

If you're doing a comparison of RNA-seq aligner accuracy, I encourage you to add BBMap as well!

ADD COMMENT
0
Entering edit mode

Thanks for this information, one little thing is still bothering me now: I removed all alignments which are not primary now and managed to determine the correct read pairs. I store the first discovered read first, and the second read after. According to your explanation, it could happen now, that read1 is actually the second one stored and vice versa. Am I right? So, in order to do this right, I still need to check the first-of-pair-flag?

ADD REPLY
0
Entering edit mode

That's correct. They are guaranteed to be name-sorted, but that's all. Since read 1 and read 2 will have identical names, there's no way pure name-sorting can ensure that read 1 comes first.

ADD REPLY
0
Entering edit mode

Ok thanks a lot, it's working fine now.

ADD REPLY

Login before adding your answer.

Traffic: 2438 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6