Hello
I am having a very strange bug in a code I am writing and when I tried to go step by step and look where the problem comes from I found this:
I have paired end reads, I do the alignment with bowtie2, no limit for the read sizes, I create my sam files, my bam and use samtools to sort and index, until here I have no problem
Then I wanted to look at the insert size in a quick way so I did
samtools view bamfile | cut -f 9 | sort -n | unique -c
and I have this distribution (this is just the header, it contains also high positive values
1 -1015
1 -1014
1 -1002
7 -1000
17 -999
9 -998
14 -997
21 -996
18 -995
13 -994
The bam file looks also Ok
M01783:49:000000000-A8MLU:1:1101:15827:12859 81 1 10003 1 76M X 155260048 0 ACCCTAACCCTAACCCTAACCCTGACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC :,,<<,,F@E66,FEC,<CD8EC,@CCF?FE<AECC6CFE8E,GGEGEFGGGFECGGGGGEGGGGGFGGGFCCCCC AS:i:-3 XS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:23A52 YT:Z:UP
M01783:49:000000000-A8MLU:1:2104:27042:8289 81 1 10003 1 76M X 155260048 0 ACGCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC 68,67CE8,6,FE,,6<EC<F@FFEGD<8@CGGFGFC,8FD;,@CFFEFFCF<CFGFE<@C<DC<FC,<GGCCCCC AS:i:-3 XS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:2C73 YT:Z:UP
M01783:49:000000000-A8MLU:1:2110:13157:18810 81 1 10003 1 76M X 155260048 0 ACCCGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC FC@7,C8GGGGGFFCC,,,<CDD@6<8FFE@CDFFFAGGFFF@C<6,FGGGECGGGGFFFCGGGCGGGGGF<ACC@ AS:i:-3 XS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:4T71 YT:Z:UP
M01783:49:000000000-A8MLU:1:1102:5613:15723 133 1 10004 0 * = 1000 0 GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG CCCCCGGGGCFEGGGGGGGGGECFFGGEE
I was expecting to have the same data using pysam, so adapted this to my code
def isPaired(bamfile, alignments=1000):
'''check if a *bamfile* contains paired end data
The method reads at most the first *alignments* and returns
True if any of the alignments are paired.
'''
samfile = pysam.Samfile(bamfile)
n = 0
for read in samfile:
if read.is_paired:
break
n += 1
if n == alignments:
break
samfile.close()
return n != alignments
def estimateInsertSizeDistribution(bamfile, alignments=1000):
'''
estimate insert size from first alignments in bam file.
returns mean and stddev of insert sizes.
'''
assert isPaired(bamfile), \
'can only estimate insert size from' \
'paired bam files'
samfile = pysam.Samfile(bamfile)
# only get positive to avoid double counting
inserts = np.array(
[read.tlen for read in samfile.head(alignments)
if read.is_proper_pair and read.tlen > 0])
return inserts
This is supposed to return these positive values that I saw with samtools command but I got only 0 (zeros) as insert size, even if I remove the condition for reads to be in proper pair, I still have zeros as insert size (9th column)
I also took a look at the reads that pysam generated, and I was surprised to get zeros at the 9th column everywhere.
Do you think it is a pysam bug ? or just something I am missing ?
and here you can compare the output from pysam where you can see clearly that the 9th column is 0 instead of 1 for the first read
>>> for read in samfile.head(2):
... print read
...
M01783:49:000000000-A8MLU:1:1101:15827:12859 81 0 10002 1 [(0, 76)] 22 155260047 76 ACCCTAACCCTAACCCTAACCCTGACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC :,,<<,,F@E66,FEC,<CD8EC,@CCF?FE<AECC6CFE8E,GGEGEFGGGFECGGGGGEGGGGGFGGGFCCCCC [('AS', -3), ('XS', -3), ('XN', 0), ('XM', 1), ('XO', 0), ('XG', 0), ('NM', 1), ('MD', '23A52'), ('YT', 'UP')]
M01783:49:000000000-A8MLU:1:2104:27042:8289 81 0 10002 1 [(0, 76)] 22 155260047 76 ACGCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC 68,67CE8,6,FE,,6<EC<F@FFEGD<8@CGGFGFC,8FD;,@CFFEFFCF<CFGFE<@C<DC<FC,<GGCCCCC [('AS', -3), ('XS', -3), ('XN', 0), ('XM', 1), ('XO', 0), ('XG', 0), ('NM', 1), ('MD', '2C73'), ('YT', 'UP')]
You're right !!
Thanks a lot, that worked just fine
I appreciate