TLEN field in SAM format
1
2
Entering edit mode
3.6 years ago
kimsumin94 ▴ 20

I am watching https://www.coursera.org/learn/genomic-tools/home/week/3, and I came across the following example sam file:

141217_CIDR4_0073_BHCFG7ADXX:2:1111:3128:29074    99    chr    10021    0    50M    =    10151    180 ...


I have a question on the 9th column, TLEN. The start position of the read above is 10021 and the start position of the mate is 10151. Than the lenghth between the two is 10151-10121+1=131.

QUESTION1: Am I correct? Is this position 0-based?

However, TLEN, which seems to be the insert size, is 180. Why is it like this? Also, in samtools spec, I've found 7.RNEXT: Reference sequence name of the primary alignment of the NEXT read in the template.

QUESTION2: What does template mean in this case? Does the template mean the set of two reads that are paired (i.e. a paired end read). Can there be more than 2 read in the template? If so, why?

QUESTION3: Does the next read in the template mean the mate of the read?

And also, I've found 9.TLEN: signed observed Template LENgth. If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base. The leftmost segment has a plus sign and the rightmost has a minus sign. The sign of segments in the middle is undefined. It is set as 0 for single-segment template or when the information is unavailable.

QUESTION4: What is the difference between SIGNED and UNSIGNED observed template length? Could you give me the two length for the above example?

QUESTION5: What doees segments in the middle mean? Is the sign of segments related to SIGNED template length?

QUESTION6: It says that the leftmost segment has a plus sign and the rightmost has a minus sign. However, in the example above, I have an optional field XS:A:-, which means the given strand is -. Isn't it the leftmost segment though?

There are 6 questions in total. It maybe a basic question since I am new to this field. Thank you very much.

Also, for the flag field, why does each bit represented as follows? (0X800)(0X400)(0X200)(0X100) (0X80)(0X40)(0X20)(0X10) (0X8)(0X4)(0X2)(0X1) It seems to be related to hex, but I don't completely understand. Thank you.

alignment • 5.0k views
0
Entering edit mode

I've added some structure and highlighting to your question. Please do that in the future as well to improve the readability and overall impression of your question. You'll see that this increases your chance of good responses.

5
Entering edit mode
3.6 years ago

Hello kimsumin94 ,

The start position of the read above is 10021 and the start position of the mate is 10151. Than the lenghth between the two is 10151-10121+1=131.

QUESTION1: Am I correct? Is this position 0-based?

The position in the SAM format are 1-based. If you would access a bam file directly it would be 0-based.

From the specs:

**1-based coordinate system**
A coordinate system where the first base of a sequence is one.  In this co-
ordinate system, a region is specified by a closed interval.  For example, the region between the 3rd
and the 7th bases inclusive is [3,7].  The **SAM**, VCF, GFF and Wiggle formats are using the 1-based
coordinate system.

**0-based coordinate system**
A  coordinate  system  where  the  first  base  of  a  sequence  is  zero.   In  this
coordinate system,  a region is specified by a half-closed-half-open interval.  For example,  the region
between the 3rd and the 7th bases inclusive is [2,7).  The **BAM**, BCFv2, BED, and PSL formats are
using the 0-based coordinate system


However, TLEN, which seems to be the insert size, is 180. Why is it like this?

The paires of the read pair have the opposite direction. One was sequenced on the + strand and one on the - strand. But in the sam file all information are meant for the + strand and are going from 5'-end to 3'-end. The read whos information must be flipped, get a flag about it.

This is how the two reads look "in vivo":

                           (R2)3'<----------5'
5'----------->3'(R1)
5'------------------------------------------- 3' (RefSeq)


And this how the information in the sam file look like:

                               5'---------->3'(R2)
5'----------->3'(R1)
5'------------------------------------------- 3' (RefSeq)


The start position given in sam is the 5' end. To get the length from the fragment from where these reads come from, you need the difference between the most left 5' position and the most right 3' position. The first information you get from the the forward read by just take the start information from sam file. The second information you get from reverse read by add the length of the read to the start position. So in your example you would end up by (10151+50)−10121= 180.

QUESTION2: What does template mean in this case? Does the template mean the set of two reads that are paired (i.e. a paired end read). Can there be more than 2 read in the template? If so, why?

Template is just the DNA fragment that was used during sequencing for the read pair. IMO "insert" is another term often used. In paired sequencing there are exactly two reads for one template.

Again from the specs:

Template
A DNA/RNA sequence part of which is sequenced on a sequencing machine or assembled from
raw sequences.


QUESTION3: Does the next read in the template mean the mate of the read?

Yes.

fin swimmer