Question: Fragment Size: TLEN vs. isize
4
gravatar for Rob
4.4 years ago by
Rob2.9k
United States
Rob2.9k wrote:

Hi,

  I'm in the process of adding (optional) support for using alignments to our tool for RNA-seq quantification, and I'm currently trying to model the fragment size distribution of aligned fragments in a set of reads.  I'm a bit confused about exactly where this information is stored in a bam/sam file.  According to the spec, the SAM file has a TLEN field that gives the "template length" of an alignment, which seems to me like it would be equivalent to the fragment length (except in the case of e.g. a chimeric alignment).  However, the API doesn't expose this field directly, and instead in SAM and BAM files parsed via samtools, you have access to an isize field, which, from various online sources seems to be the "insert size" of the aligned fragment.

  My question is how do these two fields relate?  Are they different?  Is the insert size the fragment size, the mate inner distance (frag. size - read lengths), or something else entirely?  Unfortunately, the documentation on this isn't too clear, so I'm reaching out to someone whose dealt with this in practice before.

Thanks!

Rob

sequencing alignment next-gen • 8.8k views
ADD COMMENTlink modified 4.4 years ago by Devon Ryan86k • written 4.4 years ago by Rob2.9k

I am really confused by this statement: "instead in SAM and BAM files parsed via samtools, you have access to an isize field,"

Where do you find isize in http://samtools.github.io/hts-specs/SAMv1.pdf ? I see it for block compression but that's it.

ADD REPLYlink written 4.4 years ago by Gabriel R.2.5k

Sure.  In the spec, they describe TLEN.  However, if you look at the Samtools API, the relevant information is in the bam1_core_t  structure.  Basically, I can't seem to find any mapping in the API for accessing the TLEN field, but a few people online seem to suggest that this is encoded in the isize field of the bam1_core_t structure.

ADD REPLYlink written 4.4 years ago by Rob2.9k
3
gravatar for Devon Ryan
4.4 years ago by
Devon Ryan86k
Freiburg, Germany
Devon Ryan86k wrote:

The TLEN value itself is actually isize. Yes, this can be a bit confusing.

Edit: I removed an incorrect part about TLEN.

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Devon Ryan86k
1

The word "insert size" is one of the terms that is almost never defined yet people keep using it quite liberally. It is very frustrating indeed. For example note how even the bwa manual uses the word insert size, whereas the samtools spec uses the work template length to describe the 9th column of the SAM file.

The samtools definition on TLEN says:

If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base. 

I have come to believe that isize is the same as template length and it corresponds to the length of the DNA fragment that was enclosed (inserted) between the sequencing adaptors.

No format or API should ever directly report any other measure that depends on the read lengths. The whole insert size concept that involves adding read lengths seems ill defined, perhaps a remnant of an older time when we did not know better. If reads were to overlap then the insert size would need to be negative ... 

that being said who knows what that API reports ...

 

 

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Istvan Albert ♦♦ 78k

Well, the API reports whatever was in the SAM or BAM file to begin with. I actually just edited by answer since I now recall that the whole "insert size plus read lengths" thing was specific to one aligner (can't remember which at the moment). Since the API uses isize, the spec should just be changed to match that (I'll submit a pull request if I have time).

ADD REPLYlink written 4.4 years ago by Devon Ryan86k

Thanks for the quick and clear answer, Devon!

ADD REPLYlink written 4.4 years ago by Rob2.9k

I am trying to calculate the length fragments from a given BAM file and I still find this confusing. The current SAM spec says:

TLEN: signed observed Template LENgth. If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base. The leftmost segment has a plus sign and the rightmost has a minus sign.

To me, this clearly reads fragment length. But if I understand your answer, TLEN is not fragment length. Maybe I should open a new question?

ADD REPLYlink written 3.1 years ago by Henrik Mannerström0

It'll depend on whether the alignments are spliced or not. If you're mapping DNA sequences, then the fragment length and "insert size"/"template length" are the same. For RNAseq data this is often not the case. What TLEN reports is the distance from the 5'-most to 3'-most position.

ADD REPLYlink written 3.1 years ago by Devon Ryan86k

Thank you for your quick reply! How should I interpret this BAM, which should be mapped DNA sequences:

CHM040_CHM053.5455421:1 65      chr10   38804303        255     20M     =       38804303        0       GAATGGAATCAACTCGATTG    *       NM:i:0
CHM040_CHM053.5455421:1 129     chr10   38804303        255     20M     =       38804303        0       GAATGGAATCAACTCGATTG    *       NM:i:0

What I see here is a fragment of 20bp. However, when I run this through samtools fixmate it comes out as

CHM040_CHM053.5455421:1 65      chr10   38804303        255     20M     =       38804303        0       GAATGGAATCAACTCGATTG    *       NM:i:0  MQ:i:255
CHM040_CHM053.5455421:1 129     chr10   38804303        255     20M     =       38804303        0       GAATGGAATCAACTCGATTG    *       NM:i:0  MQ:i:255

Which I take as samtools (Version: 1.2, using htslib 1.2.1) reporting zero length fragment.

So TLEN is not fragment length?

 

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Henrik Mannerström0

Samtools reports whatever the aligner told it, so the question is, "Why did my aligner think that the fragment length is 0". The answer is that the alignment took place in a biologically impossible way, so there's no meaningful way to represent the impossibility.

BTW, samtools fixmate will set TLEN to 0 for these biologically impossible situations.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Devon Ryan86k

First, I am really grateful for you taking time to explain this!  My next question is: What should a 20bp PET alignment look like then?  To clarify, I am in the belief that the reported positions in the SAM file are leftmost positions, so when the fragment length is equal to the read length, a PET might actually get the same position for both mates. BTW, the BAM public data (GSE33664). PS, this margin is getting smaller and smaller, we could switch to a new thread if you like.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Henrik Mannerström0

The width won't shrink much more.

A 20 base fragment would be:

CHM040_CHM053.5455421:1 99      chr10   38804303        255     20M     =       38804303        20       GAATGGAATCAACTCGATTG    *       NM:i:0
CHM040_CHM053.5455421:1 147     chr10   38804303        255     20M     =       38804303        20       GAATGGAATCAACTCGATTG    *       NM:i:0

Note the flags.

ADD REPLYlink written 3.1 years ago by Devon Ryan86k

OK, thanks! I see that SAM expects one of the mates to be reverse-complemented, and since this is ChIA-PET data, and for whatever other reasons, this is not so.

ADD REPLYlink written 3.1 years ago by Henrik Mannerström0

You aligner expects that, the SAM format itself doesn't really care (though samtools fixmate will only fix TLEN if this is the case). You'll need to explicitly tell your aligner what constitutes a biologically valid alignment is the mates aren't pointed toward each other.

ADD REPLYlink written 3.1 years ago by Devon Ryan86k

The reason it is reported as zero is that, from the alignment alone we don't know how long the template length actually is.

|-------->
   |--------->

Your situation is reported like the image above. We know where the template started, we know that it must be  equal or longer than the end of the read that extends further but we don't know how long the template actually is. In the most general case only the 5' ends represent fragment starts, the 3' ends are instrument dependent and terminate sooner than the fragment.

So the aligner resorts to reporting it as zero (or something else) but  when both reads map to the same strand no matter what it reports it will likely not be correct in the most general case. In your case you have to write a custom tool to extract the proper template lengths based on information specific to the experiment.

 

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Istvan Albert ♦♦ 78k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1429 users visited in the last hour