Using pysam fetch, I can find & iterate over all short reads that map to a given position/column (1000, say) in the reference genome:
BamFile = pysam.AlignmentFile(FileName, "rb") position=1000 ShortReadsMappingHere = BamFile.fetch("MyReference", position-1, position) for ShortRead in ShortReadsMappingHere: # do something
What's the most efficient way to find where the position I'm interested in (here 1000) is within each ShortRead? i.e. ShortRead.seq[0] will give me the first element of this short read, but which element is the position I want?
This line of code must get it right:
ThisPositionInThisRead = ShortRead.get_reference_positions(full_length=True).index(position-1)
because it finds, for every position in the short read, the position in the reference genome that it maps to (or 'None' if it doesn't map) and retrieves the index of the one mapping to the position I want. It's clearly super inefficient however. The following line
ThisPositionInThisRead = position - ShortRead.reference_start + ShortRead.query_alignment_start - 1
almost always gets it right, by comparing coordinates, but can get it wrong if this short read has an indel not present in the consensus.
Is there a simple (and correct) way of determining this number based on coordinates?