Pysam fetching reads with indels: Returns wrong positions of bases
1
0
Entering edit mode
8.7 years ago
Lylthera • 0

Hi,

I have a problem working with BAM files and pysam whenever I try to examine bases at a certain position within a sequencing read. What I want to do is look at all the reads covering a region (which works fine with pysam, using fetch()) and compare the base of each read at a position, thus getting e.g. all read information about the 16,078,956. base pair in the genome. Something like Read.query_alignment_sequence[0] gives me the first base in the aligned read, but getting the element at the position I want is a bit complicated - it now works by comparing coordinates.

Whenever a read contains an indel, comparing coordinates doesn't work anymore because "query_alignment_sequence" then is "shifted" and gets the positions wrong.

Is there any efficient way to work this out? If somebody has an idea, please let me know.

Thanks in advance!

pysam indel alignment bam sequence • 4.9k views
ADD COMMENT
1
Entering edit mode

Maybe you could give an example of what you're getting vs. what you expect to get. Pysam is just passing through what htslib is producing and htslib is very likely doing the correct thing.

ADD REPLY
0
Entering edit mode

Okay consider a short read example starting at base pair 23 and ending at the 30th base pair with a sequence

ACGTTATC

When I want to look at what base stands at position 29, I could do something like Read.query_alignment_sequence[6] and it would result in T, correctly.

The same read which contains deletions, something like --GT--TC, has "shifted" positions, so to get the T at position 29 like before, I can't use query_alignment_sequence any more, because then, the query_alignment_sequence consists only of GTTC (and thus has no 6th position), or it gets "filled" up with the 4 deleted bases before or after the actual sequence. Nonetheless, the base isn't the 6th position any more. What I'd need would be something that replaces the actual bases in the sequence with 0 or None. A sequence with [NONE NONE G T NONE NONE T C] would lead to no more problems. But I still couldn't figure out how to do this.

ADD REPLY
3
Entering edit mode
8.7 years ago

Pysam is working correctly and is giving you the Nth base of the read as requested. If you want the sequence in a read to be adjusted for indels then you'll have to write a function to parse the cigar string and generate a list of bases as output.

ADD COMMENT
0
Entering edit mode

Sorry for the late reply but thanks a lot, this worked properly!

ADD REPLY

Login before adding your answer.

Traffic: 1489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6