Question

Evaluation of simulated paired-read data: How are records stored in Bam files?

0

Entering edit mode

7.3 years ago

m.picciani • 0

Given a list of simulated read-pairs, and the sorted (by read-id) bam output of STAR and HISAT, I would like to evaluate the mapping quality by assigning one of the following categories: wrong chromosome, partial correct and perfectly mapped to only the primary alignments reported. I would like to do this with htsjdk in Java.

I need help with the following questions, as I do not completely understand the bamfile documentation.

-In a sorted bam file, will the second mate always be directly after the corresponding first mate of a read-pair?

-What if one of those mates is unmapped? Will it still be stored or could it happen, that there is no record at all?

-What if both mates are unmapped? Will they still be reported or simply ignored?

So the general idea would be to simply read the list of read pairs through without checking for the id and compare the results of the corresponding record in the sorted bam, which would be possible, if, in every case, the read-pair gets at least one record for both mates, even if one or both are unmapped. So does this work?

Thanks for all suggestions.

alignment rna-seq • 1.2k views

ADD COMMENT • link updated 7.3 years ago by Brian Bushnell 20k • written 7.3 years ago by m.picciani • 0

score 1 · Answer 1 · 2017-01-14

1

Entering edit mode

7.3 years ago

Brian Bushnell 20k

In a sorted bam file, will the second mate always be directly after the corresponding first mate of a read-pair?

Not in general, but for id-sorted reads, pairs should be adjacent, if you only include primary alignments. You can tell whether the read is read 1 or read 2 in a pair by examining the flag field.

What if both mates are unmapped? Will they still be reported or simply ignored?

Unmapped reads are generally reported; some programs allow you to remove them if desired.

If you're doing a comparison of RNA-seq aligner accuracy, I encourage you to add BBMap as well!

ADD COMMENT • link 7.3 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks for this information, one little thing is still bothering me now: I removed all alignments which are not primary now and managed to determine the correct read pairs. I store the first discovered read first, and the second read after. According to your explanation, it could happen now, that read1 is actually the second one stored and vice versa. Am I right? So, in order to do this right, I still need to check the first-of-pair-flag?

ADD REPLY • link 7.3 years ago by m.picciani • 0

0

Entering edit mode

That's correct. They are guaranteed to be name-sorted, but that's all. Since read 1 and read 2 will have identical names, there's no way pure name-sorting can ensure that read 1 comes first.