Sam file arrangement
1
0
Entering edit mode
12 weeks ago
aenna_p • 0

Hello,

I have a question regarding the length information of reads obtained from BAM files. I have converted BAM files into BED files and kept the read sequence. So, it looks something like this:

Chr 6791    7891    TCGAATATCAGGGTGCCCTCTGGCAAGGGCTTGCCCAGCGTACGTCAC    -
Chr 6966    7304    ATTGATGAGGGATGTGGGTGGATGGATGATGATGGAAATATGATATGC    +


I always assumed that columns 2 and 3 provide information on the start and end positions of the read alignment. So, column3 - column2 is the read length. However, if I calculate the number of characters in the DNA string (column 4) with function nchar() in R, I get a different value.
Can anyone explain what I am missing?

Thank you!

BAM BED • 303 views
1
Entering edit mode
12 weeks ago
ATpoint 55k

Alignment length != read length. Reads might got soft-clipped, and parts of the read might align elsewhere, depending on how te aligner handles clipping and non-primary alignments.

0
Entering edit mode

Thank you! I do understand why read length may be larger than alignment length. But I still do not understand how sometimes the alignment length can be larger than the read length. Can you explain this further?

2
Entering edit mode
Alignment:  GATCGATCACTGACGTATCTAGGCGATCAGTCGTACGTATCACTA


Here a simple example of a deletion in the read compared to the reference that makes the alignment two bp larger than the read length, as start and end of the alignment define the coordinates.

0
Entering edit mode

That was very well simplified! Thanks!