samtools and pysam different report of the insert size
1
0
Entering edit mode
10.7 years ago
Rad ▴ 810

Hello

I am having a very strange bug in a code I am writing and when I tried to go step by step and look where the problem comes from I found this:

I have paired end reads, I do the alignment with bowtie2, no limit for the read sizes, I create my sam files, my bam and use samtools to sort and index, until here I have no problem

Then I wanted to look at the insert size in a quick way so I did

samtools view bamfile | cut -f 9 | sort -n | unique -c

and I have this distribution (this is just the header, it contains also high positive values

1 -1015
1 -1014
1 -1002
7 -1000
17 -999
9 -998
14 -997
21 -996
18 -995
13 -994

The bam file looks also Ok

M01783:49:000000000-A8MLU:1:1101:15827:12859    81    1    10003    1    76M    X    155260048    0    ACCCTAACCCTAACCCTAACCCTGACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    :,,<<,,F@E66,FEC,<CD8EC,@CCF?FE<AECC6CFE8E,GGEGEFGGGFECGGGGGEGGGGGFGGGFCCCCC    AS:i:-3    XS:i:-3    XN:i:0    XM:i:1    XO:i:0    XG:i:0    NM:i:1    MD:Z:23A52    YT:Z:UP
M01783:49:000000000-A8MLU:1:2104:27042:8289    81    1    10003    1    76M    X    155260048    0    ACGCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    68,67CE8,6,FE,,6<EC<F@FFEGD<8@CGGFGFC,8FD;,@CFFEFFCF<CFGFE<@C<DC<FC,<GGCCCCC    AS:i:-3    XS:i:-3    XN:i:0    XM:i:1    XO:i:0    XG:i:0    NM:i:1    MD:Z:2C73    YT:Z:UP
M01783:49:000000000-A8MLU:1:2110:13157:18810    81    1    10003    1    76M    X    155260048    0    ACCCGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    FC@7,C8GGGGGFFCC,,,<CDD@6<8FFE@CDFFFAGGFFF@C<6,FGGGECGGGGFFFCGGGCGGGGGF<ACC@    AS:i:-3    XS:i:-3    XN:i:0    XM:i:1    XO:i:0    XG:i:0    NM:i:1    MD:Z:4T71    YT:Z:UP
M01783:49:000000000-A8MLU:1:1102:5613:15723    133    1    10004    0    *    =    1000    0    GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG    CCCCCGGGGCFEGGGGGGGGGECFFGGEE

I was expecting to have the same data using pysam, so adapted this to my code

def isPaired(bamfile, alignments=1000):
    '''check if a *bamfile* contains paired end data

    The method reads at most the first *alignments* and returns
    True if any of the alignments are paired.
    '''
    samfile = pysam.Samfile(bamfile)
    n = 0
    for read in samfile:
        if read.is_paired:
            break
        n += 1
        if n == alignments:
            break
    samfile.close()
    return n != alignments

def estimateInsertSizeDistribution(bamfile, alignments=1000):
    '''
    estimate insert size from first alignments in bam file.
    returns mean and stddev of insert sizes.
    '''
    assert isPaired(bamfile), \
        'can only estimate insert size from' \
        'paired bam files'
    samfile = pysam.Samfile(bamfile)
    # only get positive to avoid double counting
    inserts = np.array(
        [read.tlen for read in samfile.head(alignments)
         if read.is_proper_pair and read.tlen > 0])
    return inserts

This is supposed to return these positive values that I saw with samtools command but I got only 0 (zeros) as insert size, even if I remove the condition for reads to be in proper pair, I still have zeros as insert size (9th column)

I also took a look at the reads that pysam generated, and I was surprised to get zeros at the 9th column everywhere.

Do you think it is a pysam bug ? or just something I am missing ?

and here you can compare the output from pysam where you can see clearly that the 9th column is 0 instead of 1 for the first read

>>> for read in samfile.head(2):
...   print read
...
M01783:49:000000000-A8MLU:1:1101:15827:12859    81    0    10002    1    [(0, 76)]    22    155260047    76    ACCCTAACCCTAACCCTAACCCTGACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    :,,<<,,F@E66,FEC,<CD8EC,@CCF?FE<AECC6CFE8E,GGEGEFGGGFECGGGGGEGGGGGFGGGFCCCCC    [('AS', -3), ('XS', -3), ('XN', 0), ('XM', 1), ('XO', 0), ('XG', 0), ('NM', 1), ('MD', '23A52'), ('YT', 'UP')]
M01783:49:000000000-A8MLU:1:2104:27042:8289    81    0    10002    1    [(0, 76)]    22    155260047    76    ACGCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    68,67CE8,6,FE,,6<EC<F@FFEGD<8@CGGFGFC,8FD;,@CFFEFFCF<CFGFE<@C<DC<FC,<GGCCCCC    [('AS', -3), ('XS', -3), ('XN', 0), ('XM', 1), ('XO', 0), ('XG', 0), ('NM', 1), ('MD', '2C73'), ('YT', 'UP')]
bowtie2 samtools insert pysam • 6.7k views
ADD COMMENT
2
Entering edit mode
10.7 years ago

Increase alignments from 1000 to 50,000 in (bamfile, alignments=1000) and then you should see non zero insert-sizes. All the alignments in the bam output you have shown above have TLEN (~insert sizes) as 0. This happens when either one of the read is unmapped or both the reads map to different chromosomes. It is highly likely that lot of your alignments belong to one of these two cases and when you sample a small set (n=1000) of your alignments, you cant find any alignment with non zero insert sizes. Rerun the command you pasted above

samtools view bamfile | cut -f 9 | sort -n | unique -c

and you will see lot of alignments with '0' insert sizes. My answer is based on an assumption and I may be wrong.

ADD COMMENT
0
Entering edit mode

You're right !!

Thanks a lot, that worked just fine

I appreciate

ADD REPLY

Login before adding your answer.

Traffic: 2837 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6