Does anybody know, how is it possible for a read to get both "first in pair" and "second in pair if the flag also means "read mapped in proper pair"? (i.e see Picard tool explanation for flags 195, 211, 227 or 243)
There is no 'first in pair' or 'second in pair', only 'first segment in template' and 'last segment in template'. Although no aligner i've ever used sets these flags for single-end reads, i see no reason why it couldnt. Particularly if some sort of merging tool was used that combines overlapping pairs into a single read. EDIT: There's also this straight from the BAM spec:
• If 0x40 and 0x80 are both set, the read is part of a linear template, but it is neither the first nor the last read. If both 0x40 and 0x80 are unset, the index of the read in the template is unknown. This may happen for a non-linear template or the index is lost in data processing
Also, you can't trust 'proper pair' (which also doesnt exist, it's "properly aligned"), secondary alignment or supplementary alignment if unmapped is also set, so you should check that out too.
The BAM format was written to be future-proof, at a time when the future was unclear. It was perfectly plausible that in the future all sorts of sequencing technologies could be invented where large fragments get sequenced in many spots, so the spec tries to stay away from 'paired' terminology as much as possible. However, it seems that multiple-reads-per-fragment sequencing is not likely to ever happen, and we are more likely to go down the path of a few really really long reads. For this reason, there is a split between what the strict definitions set out in the spec say - which is also what the aligners/tools most likely to follow - and the practical application of the specification that bioinformaticians practice. It upsets me that a read can be "on chromosome 1" but also "unmapped", because that makes no logical sense. However, it's part of SAM spec, and if you're not aware of it, it will come back to bite you. I think, if you're going to work with SAM/BAM files, you really need to be aware of this difference between your intuition/expectations and what the spec actually says, otherwise you'll have errors that you wont catch because no tool will tell you that you're doing something wrong on a spec-compliant BAM file. Well... no tool except Picard. Picard hates everything. :P
READ2 may mean different things depending on the library preparation and methodology.
In an Illumina sequencing they refer to the order in which the fragment is sequenced and that means a separation in both space and time. The instrument first produces reads that get placed in the first file, these will be marked
READ1 after alignment. Then, some time later, once the
READ1 data is complete the fragments attached to the flowcell get complemented, "bend over" to a neighboring spots, and are sequenced again as if it were a new run. This data will go into a file 2 and will be labeled as
READ2 in the SAM file.
Hence having a read marked as both
READ2 at the same time is incorrect considering the "normal" definition. There is probably a story behind why these flags are set as such - someone had to run some tool that would only work if ... fill in the blanks ... the solution was to set the flags a incorrect values
Can you share more information about the data? A few lines of the bam as examples would help too. You can share the flag, chromosome, position, and cigar value to help get an idea of the alignment.
- What kind of library prep is it? (WGS, WXS, RNA)
- What sequencing platform was used?
- What reference are the reads aligned to?
- How were the reads aligned? (check header)
- Were the read alignments refined after the original mapping? (check header).
- Are you sure there are exactly 2 reads in theses pairs with the odd flag values? Make sure there isn't only 1 or 3 of these read-ids.