Question: weird SAM flag explanation
2
gravatar for lilach.kornblitno
4.8 years ago by
European Union
lilach.kornblitno40 wrote:

Does anybody know, how is it possible for a read to get both "first in pair" and "second in pair if the flag also means "read mapped in proper pair"? (i.e see Picard tool explanation for flags 195, 211, 227 or 243) 

sam • 1.6k views
ADD COMMENTlink modified 3.1 years ago by karsten.sieber10 • written 4.8 years ago by lilach.kornblitno40
1

Could you provide a few example lines from the SAM file.  This sounds very odd.

ADD REPLYlink written 4.8 years ago by Ian5.4k

I think it should not be possible for a read to have such flags. where did you get the sam from?

ADD REPLYlink written 4.8 years ago by Martombo2.4k

Thanks for reply, It looks odd but i got the original BAM file from the TCGA database.

I'm trying to understand if there's a logical explanation for this before i'll assume its data integrity issue

ADD REPLYlink written 4.8 years ago by lilach.kornblitno40

Is there any update on this post, please? I have seem read with 195 flag from TCGA dataset, too, being a bit confused.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by -_-810
0
gravatar for John
3.2 years ago by
John12k
Germany
John12k wrote:

There is no 'first in pair' or 'second in pair', only 'first segment in template' and 'last segment in template'. Although no aligner i've ever used sets these flags for single-end reads, i see no reason why it couldnt. Particularly if some sort of merging tool was used that combines overlapping pairs into a single read. EDIT: There's also this straight from the BAM spec:

• If 0x40 and 0x80 are both set, the read is part of a linear template, but it is neither the first nor the last read. If both 0x40 and 0x80 are unset, the index of the read in the template is unknown. This may happen for a non-linear template or the index is lost in data processing

Also, you can't trust 'proper pair' (which also doesnt exist, it's "properly aligned"), secondary alignment or supplementary alignment if unmapped is also set, so you should check that out too.

The BAM format was written to be future-proof, at a time when the future was unclear. It was perfectly plausible that in the future all sorts of sequencing technologies could be invented where large fragments get sequenced in many spots, so the spec tries to stay away from 'paired' terminology as much as possible. However, it seems that multiple-reads-per-fragment sequencing is not likely to ever happen, and we are more likely to go down the path of a few really really long reads. For this reason, there is a split between what the strict definitions set out in the spec say - which is also what the aligners/tools most likely to follow - and the practical application of the specification that bioinformaticians practice. It upsets me that a read can be "on chromosome 1" but also "unmapped", because that makes no logical sense. However, it's part of SAM spec, and if you're not aware of it, it will come back to bite you. I think, if you're going to work with SAM/BAM files, you really need to be aware of this difference between your intuition/expectations and what the spec actually says, otherwise you'll have errors that you wont catch because no tool will tell you that you're doing something wrong on a spec-compliant BAM file. Well... no tool except Picard. Picard hates everything. :P

ADD COMMENTlink modified 3.1 years ago • written 3.2 years ago by John12k
0
gravatar for Istvan Albert
3.2 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

These flags READ1 and READ2 may mean different things depending on the library preparation and methodology.

In an Illumina sequencing they refer to the order in which the fragment is sequenced and that means a separation in both space and time. The instrument first produces reads that get placed in the first file, these will be marked READ1 after alignment. Then, some time later, once the READ1 data is complete the fragments attached to the flowcell get complemented, "bend over" to a neighboring spots, and are sequenced again as if it were a new run. This data will go into a file 2 and will be labeled as READ2 in the SAM file.

Hence having a read marked as both READ1 and READ2 at the same time is incorrect considering the "normal" definition. There is probably a story behind why these flags are set as such - someone had to run some tool that would only work if ... fill in the blanks ... the solution was to set the flags a incorrect values

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Istvan Albert ♦♦ 80k
0
gravatar for karsten.sieber
3.1 years ago by
United States
karsten.sieber10 wrote:

Can you share more information about the data? A few lines of the bam as examples would help too. You can share the flag, chromosome, position, and cigar value to help get an idea of the alignment.

  • What kind of library prep is it? (WGS, WXS, RNA)
  • What sequencing platform was used?
  • What reference are the reads aligned to?
  • How were the reads aligned? (check header)
  • Were the read alignments refined after the original mapping? (check header).
  • Are you sure there are exactly 2 reads in theses pairs with the odd flag values? Make sure there isn't only 1 or 3 of these read-ids.
ADD COMMENTlink written 3.1 years ago by karsten.sieber10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1060 users visited in the last hour