pysam fetch: find desired position within short read
1
1
Entering edit mode
7.6 years ago
c.wymant ▴ 10

Using pysam fetch, I can find & iterate over all short reads that map to a given position/column (1000, say) in the reference genome:

BamFile = pysam.AlignmentFile(FileName, "rb")
position=1000
# do something

What's the most efficient way to find where the position I'm interested in (here 1000) is within each ShortRead? i.e. ShortRead.seq[0] will give me the first element of this short read, but which element is the position I want?
This line of code must get it right:

  ThisPositionInThisRead = ShortRead.get_reference_positions(full_length=True).index(position-1)


because it finds, for every position in the short read, the position in the reference genome that it maps to (or 'None' if it doesn't map) and retrieves the index of the one mapping to the position I want. It's clearly super inefficient however. The following line

  ThisPositionInThisRead = position - ShortRead.reference_start + ShortRead.query_alignment_start - 1

almost always gets it right, by comparing coordinates, but can get it wrong if this short read has an indel not present in the consensus.
Is there a simple (and correct) way of determining this number based on coordinates?

pysam fetch mapping short reads • 5.1k views
0
Entering edit mode
7.6 years ago

Have a look at the mpileup functions (I've not used it with pysam, just the base C API). It'll let you iterate over reads covering a given position and indicate for you which base in a read actually covers that position (it also indicates insertions/deletions).

An example of doing this with the C API (the function names are more or less the same, since pysam is a wrapper around the htslib C API) can be found in PileOMeth (within each pileup, the qpos value indicates the actual position within the alignment).