samtools and pysam different report of the insert size
Entering edit mode
7.2 years ago
Rad ▴ 800


I am having a very strange bug in a code I am writing and when I tried to go step by step and look where the problem comes from I found this :

I have paired end reads, I do the alignment with bowtie2, no limit for the read sizes, I create my sam files, my bam and use samtools to sort and index, until here I have no problem

Then I wanted to look at the insert size in a quick way so I did 


samtools view bamfile | cut -f 9 | sort -n | unique -c


and I have this distribution (this is just the header, it contains also high positive values


1 -1015
1 -1014
1 -1002
7 -1000
17 -999
9 -998
14 -997
21 -996
18 -995
13 -994


The bam file looks also Ok


M01783:49:000000000-A8MLU:1:1101:15827:12859    81    1    10003    1    76M    X    155260048    0    ACCCTAACCCTAACCCTAACCCTGACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    :,,<<,,F@E66,FEC,<CD8EC,@CCF?FE<AECC6CFE8E,GGEGEFGGGFECGGGGGEGGGGGFGGGFCCCCC    AS:i:-3    XS:i:-3    XN:i:0    XM:i:1    XO:i:0    XG:i:0    NM:i:1    MD:Z:23A52    YT:Z:UP
M01783:49:000000000-A8MLU:1:2104:27042:8289    81    1    10003    1    76M    X    155260048    0    ACGCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    68,67CE8,6,FE,,6<EC<F@FFEGD<8@CGGFGFC,8FD;,@CFFEFFCF<CFGFE<@C<DC<FC,<GGCCCCC    AS:i:-3    XS:i:-3    XN:i:0    XM:i:1    XO:i:0    XG:i:0    NM:i:1    MD:Z:2C73    YT:Z:UP
M01783:49:000000000-A8MLU:1:2110:13157:18810    81    1    10003    1    76M    X    155260048    0    ACCCGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    FC@7,C8GGGGGFFCC,,,<CDD@6<8FFE@CDFFFAGGFFF@C<6,FGGGECGGGGFFFCGGGCGGGGGF<ACC@    AS:i:-3    XS:i:-3    XN:i:0    XM:i:1    XO:i:0    XG:i:0    NM:i:1    MD:Z:4T71    YT:Z:UP


I was expecting to have the same data using pysam, so adapted this to my code


def isPaired(bamfile, alignments=1000):
    '''check if a *bamfile* contains paired end data

    The method reads at most the first *alignments* and returns
    True if any of the alignments are paired.
    samfile = pysam.Samfile(bamfile)
    n = 0
    for read in samfile:
        if read.is_paired:
        n += 1
        if n == alignments:
    return n != alignments

def estimateInsertSizeDistribution(bamfile, alignments=1000):
    estimate insert size from first alignments in bam file.
    returns mean and stddev of insert sizes.
    assert isPaired(bamfile), \
        'can only estimate insert size from' \
        'paired bam files'
    samfile = pysam.Samfile(bamfile)
    # only get positive to avoid double counting
    inserts = np.array(
        [read.tlen for read in samfile.head(alignments)
         if read.is_proper_pair and read.tlen > 0])
    return inserts


This is supposed to return these positive values that I saw with samtools command but I got only 0 (zeros) as insert size, even if I remove the condition for reads to be in proper pair, I still have zeros as insert size (9th column)

I also took a look at the reads that pysam generated, and I was surprised to get zeros at the 9th column everywhere.

Do you think it is a pysam bug ? or just something I am missing ?


and here you can compare the output from pysam where you can see clearly that the 9th column is 0 instead of 1 for the first read

>>> for read in samfile.head(2):
...   print read
M01783:49:000000000-A8MLU:1:1101:15827:12859    81    0    10002    1    [(0, 76)]    22    155260047    76    ACCCTAACCCTAACCCTAACCCTGACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    :,,<<,,F@E66,FEC,<CD8EC,@CCF?FE<AECC6CFE8E,GGEGEFGGGFECGGGGGEGGGGGFGGGFCCCCC    [('AS', -3), ('XS', -3), ('XN', 0), ('XM', 1), ('XO', 0), ('XG', 0), ('NM', 1), ('MD', '23A52'), ('YT', 'UP')]
M01783:49:000000000-A8MLU:1:2104:27042:8289    81    0    10002    1    [(0, 76)]    22    155260047    76    ACGCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    68,67CE8,6,FE,,6<EC<F@FFEGD<8@CGGFGFC,8FD;,@CFFEFFCF<CFGFE<@C<DC<FC,<GGCCCCC    [('AS', -3), ('XS', -3), ('XN', 0), ('XM', 1), ('XO', 0), ('XG', 0), ('NM', 1), ('MD', '2C73'), ('YT', 'UP')]
insert bowtie2 samtools pysam • 5.3k views
Entering edit mode

You're right !!

Thanks a lot, that worked just fine

I appreciate


Entering edit mode
7.2 years ago

Increase alignments from 1000 to 50,000 in (bamfile, alignments=1000) and then you should see non zero insert-sizes. All the alignments in the bam output you have shown above have TLEN (~insert sizes) as 0. This happens when either one of the read is unmapped or both the reads map to different chromosomes. It is highly likely that lot of your alignments belong to one of these two cases and when you sample a small set (n=1000) of your alignments, you cant find any alignment with non zero insert sizes. Rerun the command you pasted above

samtools view bamfile | cut -f 9 | sort -n | unique -c

and you will see lot of alignments with '0' insert sizes.  My answer is based on an assumption and I may be wrong.


Login before adding your answer.

Traffic: 2793 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6