Question: Discrepancies in Bam Read Lengths
gravatar for jdimatteo
5.2 years ago by
United States
jdimatteo40 wrote:


Three sources suggest different read lengths -- how can I definitively find the read length?

1. A colleague suggested that this bam file has reads 40 base pairs long,

2. samtools view seems to suggest the reads are 44 base pairs long, and

3. Tablet ( seems to suggest most reads are 40 base pairs long but shows a couple less than 40 (e.g. 36, 38, and 39) base pairs long.

For samtools, I'm running the following command, based off of How Can I Know The Length Of Mapped Reads From Bam File? :

$ samtools view foo.bam | head -n 1000000 | cut -f 10 | perl -ne 'chomp;print length($_) . "\n"' | sort | uniq -c
1000000 44
$ samtools view foo.bam | head -n 1
HWI-ST673_0087:1:1103:11279:92479#NNNNNN/1    0    chrM    1    255    4M4I36M    *    0    0    GATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCT    bbbeeaceggggfeghiiiiiiiiiiihiiihiiiiihfhhhhi    AS:i:-23    XN:i:0    XM:i:1    XO:i:1    XG:i:4    NM:i:5    MD:Z:3C36    YT:Z:UU

With tablet, I see the following:

I assume that I'm just misinterpreting the output. I don't have any formal background in bioinformatics (I am a software developer).  If anyone could help set me straight here, I'd appreciate it, or maybe suggest some good relevant reading.  Do bam files always have uniform read lengths?

Thanks for the help! 

Update: was looking at different file in tablet -- discrepancy resolved!

bam • 1.3k views
ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by jdimatteo40
gravatar for matted
5.2 years ago by
Boston, United States
matted7.2k wrote:

The sequence reported in the bam file is what I would trust for the read length.  This may be different from the alignment length, due to clipping, insertions, or deletions (see here for more).  The visualizers you use may only plot the aligned portion of the read, which could lead to the differences that you're seeing.

There are no rules about read lengths in bam files, since you can put anything into them and they may represent combinations of various experiments.  My experience is that a single Illumina sequencing run will have constant read length, but if your files represent multiple runs or have preprocessed data (maybe via adapter or poly-A trimming), that assumption won't hold.

ADD COMMENTlink written 5.2 years ago by matted7.2k

Thanks for the explanation.  It turns out I was stupidly looking at the wrong file in Tablet!

But your answer and the others helped me better understand samtools view, cigar, and what to expect with regards to uniform read length -- thank you!

ADD REPLYlink written 5.2 years ago by jdimatteo40
gravatar for geek_y
5.2 years ago by
geek_y10k wrote:

You should also consider the CIGAR string. 4M4I32M

ADD COMMENTlink written 5.2 years ago by geek_y10k

Thanks for pointing out that useful link -- I hadn't understood cigar before reading that!

So a cigar of "4M4I36M" for "GATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCT", means we have 4 matching the reference, then 4 that don't match, then 36 that match the reference.

So the read is 44 long total according to the CIGAR:

4M  4I  36M
ADD REPLYlink written 5.2 years ago by jdimatteo40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1830 users visited in the last hour