Question

Discrepancies in Bam Read Lengths

1

Entering edit mode

9.2 years ago

jdimatteo ▴ 80

Hello,

Three sources suggest different read lengths -- how can I definitively find the read length?

A colleague suggested that this bam file has reads 40 base pairs long,
samtools view seems to suggest the reads are 44 base pairs long, and
Tablet seems to suggest most reads are 40 base pairs long but shows a couple less than 40 (e.g. 36, 38, and 39) base pairs long.

For samtools, I'm running the following command, based off of How Can I Know The Length Of Mapped Reads From Bam File?:

$ samtools view foo.bam | head -n 1000000 | cut -f 10 | perl -ne 'chomp;print length($_) . "\n"' | sort | uniq -c
1000000 44
$ samtools view foo.bam | head -n 1
HWI-ST673_0087:1:1103:11279:92479#NNNNNN/1    0    chrM    1    255    4M4I36M    *    0    0    GATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCT    bbbeeaceggggfeghiiiiiiiiiiihiiihiiiiihfhhhhi    AS:i:-23    XN:i:0    XM:i:1    XO:i:1    XG:i:4    NM:i:5    MD:Z:3C36    YT:Z:UU

With tablet, I see the following:

< image not found >

I assume that I'm just misinterpreting the output. I don't have any formal background in bioinformatics (I am a software developer). If anyone could help set me straight here, I'd appreciate it, or maybe suggest some good relevant reading. Do bam files always have uniform read lengths?

Thanks for the help!

Update: was looking at different file in tablet -- discrepancy resolved!

bam • 2.3k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by jdimatteo ▴ 80

1

Entering edit mode

9.2 years ago

GouthamAtla 12k

You should also consider the CIGAR string. 4M4I32M

ADD COMMENT • link 9.2 years ago by GouthamAtla 12k

0

Entering edit mode

Thanks for pointing out that useful link -- I hadn't understood cigar before reading that!

So a cigar of "4M4I36M" for "GATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCT", means we have 4 matching the reference, then 4 that don't match, then 36 that match the reference.

So the read is 44 long total according to the CIGAR:

4M  4I  36M
GATG    ACAGGTCTATCACCCTATTAACCACTCACGGGAGCT
    GATC

ADD REPLY • link 9.2 years ago by jdimatteo ▴ 80

Ram · Accepted Answer · 2015-02-07

2

Entering edit mode

9.2 years ago

matted 7.8k

The sequence reported in the bam file is what I would trust for the read length. This may be different from the alignment length, due to clipping, insertions, or deletions (see here for more). The visualizers you use may only plot the aligned portion of the read, which could lead to the differences that you're seeing.

There are no rules about read lengths in bam files, since you can put anything into them and they may represent combinations of various experiments. My experience is that a single Illumina sequencing run will have constant read length, but if your files represent multiple runs or have preprocessed data (maybe via adapter or poly-A trimming), that assumption won't hold.

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by matted 7.8k

1

Entering edit mode

Thanks for the explanation. It turns out I was stupidly looking at the wrong file in Tablet!

But your answer and the others helped me better understand samtools view, cigar, and what to expect with regards to uniform read length -- thank you!

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by jdimatteo ▴ 80