Hi,
I have a problem working with BAM files and pysam whenever I try to examine bases at a certain position within a sequencing read. What I want to do is look at all the reads covering a region (which works fine with pysam, using fetch()
) and compare the base of each read at a position, thus getting e.g. all read information about the 16,078,956. base pair in the genome. Something like Read.query_alignment_sequence[0]
gives me the first base in the aligned read, but getting the element at the position I want is a bit complicated - it now works by comparing coordinates.
Whenever a read contains an indel, comparing coordinates doesn't work anymore because "query_alignment_sequence" then is "shifted" and gets the positions wrong.
Is there any efficient way to work this out? If somebody has an idea, please let me know.
Thanks in advance!
Maybe you could give an example of what you're getting vs. what you expect to get. Pysam is just passing through what htslib is producing and htslib is very likely doing the correct thing.
Okay consider a short read example starting at base pair 23 and ending at the 30th base pair with a sequence
ACGTTATC
When I want to look at what base stands at position 29, I could do something like
Read.query_alignment_sequence[6]
and it would result inT
, correctly.The same read which contains deletions, something like
--GT--TC
, has "shifted" positions, so to get theT
at position 29 like before, I can't usequery_alignment_sequence
any more, because then, thequery_alignment_sequence
consists only ofGTTC
(and thus has no 6th position), or it gets "filled" up with the 4 deleted bases before or after the actual sequence. Nonetheless, the base isn't the 6th position any more. What I'd need would be something that replaces the actual bases in the sequence with0
orNone
. A sequence with[NONE NONE G T NONE NONE T C]
would lead to no more problems. But I still couldn't figure out how to do this.