Question

Many TLENs in a SAM file seem much too big

0

Entering edit mode

3.4 years ago

langziv ▴ 70

Hello.

I have a SAM file from an alignment between a human sample against an NCBI reference genome. I took the TLENs (the 9th field in each SAM file entry) from the SAM file. Some TLENs are over 240 million base pairs. Does anyone know how can reads get that long? maybe a polymerase in the some of the sequencing elongations managed to elongate a DNA just by chance.

Thanks.

NGS SAM comperative-genomics alignment • 1.3k views

ADD COMMENT • link updated 3.3 years ago by d-cameron ★ 2.9k • written 3.4 years ago by langziv ▴ 70

0

Entering edit mode

Hello,

I would guess that for those read wich such a huge TLENs value the mapping is incorrect. What's the mapping quality for them?

fin swimmer

ADD REPLY • link 3.4 years ago by finswimmer 16k

score 1 · Answer 1 · 2021-07-09

Some TLENs are over 240 million base pairs. Does anyone know how can reads get that long?

Neither read is 240Mb in length (it'll be 100 or 150 or whatever length your sequenced to). The 240Mb is the inferred template length based on read alignments of the two reads in the read pair. If the reads in a read pairs alignment 240Mb from each other, TLEN is going to be 240Mb^. The actual sequenced DNA fragment (template in SAM specification terminology) is certainly not 240Mb in length. Possible reasons are:

Incorrect alignment

The read alignment for one or both reads could be incorrect. Large genome have many repetative regions and most aligners, when faced with many possible alignment locations, will random choose one. Aligners will typically report a mapping quality score to indicate a level of confidence in the alignment. Low mapq indicate the aligner has low confidence that the alignment is correct.

E.g.: a fragment from a telomere could have one read aligned to each telomere of chr1. This would result in a TLEN of around 249Mb.

E.g.: the two reads from a L1HS LINE element could be aligned to different L1HS repeats in the reference

Library preparation

Chimeric fragments resulted the generation of a chimeric fragment in which two DNA segments are repaired to each other during library preparation. This will result in the two reads aligning to different locations. If those locations were near the start/end of a large chromosome, you'll get a very large TLEN.

The rate of such errors depends on the sample library preparation used. For example, FFPE samples have a higher rate than fresh frozen.

Structural variantion

The sample genome could contain a genome rearrangement causing two regions that are very far apart in the reference to be close together in the sample (e.g a translocation from one end of a chromosome to another). If the sequence fragment straddles a breakpoint, the TLEN will to be large but actual length of the sequence fragment would be in line with the library fragment size distribution.

^ Whether TLEN includes the length of the read alignments depends aligner. Some do, some don't, and the SAM specifications themselves have changed so there's never going to be consensus going forward.