Some TLENs are over 240 million base pairs. Does anyone know how can reads get that long?
Neither read is 240Mb in length (it'll be 100 or 150 or whatever length your sequenced to). The 240Mb is the inferred template length based on read alignments of the two reads in the read pair. If the reads in a read pairs alignment 240Mb from each other, TLEN is going to be 240Mb^. The actual sequenced DNA fragment (
template in SAM specification terminology) is certainly not 240Mb in length. Possible reasons are:
The read alignment for one or both reads could be incorrect. Large genome have many repetative regions and most aligners, when faced with many possible alignment locations, will random choose one. Aligners will typically report a mapping quality score to indicate a level of confidence in the alignment. Low mapq indicate the aligner has low confidence that the alignment is correct.
E.g.: a fragment from a telomere could have one read aligned to each telomere of chr1. This would result in a TLEN of around 249Mb.
E.g.: the two reads from a L1HS LINE element could be aligned to different L1HS repeats in the reference
Chimeric fragments resulted the generation of a chimeric fragment in which two DNA segments are repaired to each other during library preparation. This will result in the two reads aligning to different locations. If those locations were near the start/end of a large chromosome, you'll get a very large TLEN.
The rate of such errors depends on the sample library preparation used. For example, FFPE samples have a higher rate than fresh frozen.
The sample genome could contain a genome rearrangement causing two regions that are very far apart in the reference to be close together in the sample (e.g a translocation from one end of a chromosome to another). If the sequence fragment straddles a breakpoint, the TLEN will to be large but actual length of the sequence fragment would be in line with the library fragment size distribution.
^ Whether TLEN includes the length of the read alignments depends aligner. Some do, some don't, and the SAM specifications themselves have changed so there's never going to be consensus going forward.