Question: Iterating through paired-end reads in samtools / pysam
0
gravatar for olavur
3 months ago by
olavur60
Tórshavn, Faroe Islands
olavur60 wrote:

I want to process both reads in a paired-end read simultaneously, but don't know how to do this efficiently.

I'm currently doing it the following way, using the pysam Python wrapper for for the htslib API (Samtools).

samfile = AlignmentFile(filename, 'rb')  # BAM file reader.
# Iterate through reads.
for read in samfile:
    # Check that the read has a pair that is mapped and not a duplicate.
    if read.is_paired and not read.mate_is_unmapped and not read.is_duplicate:
        # Get the other read in the pair.
        read_mate = samfile.mate(read)

This is really slow however.

From what I understand, the samfile.mate(read) call makes the BAM file reader jump to another place in the BAM. So it sounds like it would be a good idea to have two readers open and using one to iterate through the reads and the other to fetch the mate read, but that doesn't speed things up. I don't know whether paired-end reads are adjacent in a sorted and indexed BAM file, in that I case I would be able to take advantage of that, and do something along the lines of:

reads = samfile.fetch()
for i in range(number_of_pairs):
    read1 = next(reads)
    read2 = next(reads)

Anyone know how I can do this in a way that won't take forever to run?

sam bam python • 241 views
ADD COMMENTlink modified 29 days ago by Michi930 • written 3 months ago by olavur60
1

Anyone know how I can do this in a way that won't take forever to run?

sort your bam on query name

ADD REPLYlink written 3 months ago by Pierre Lindenbaum110k

I tried this, and from what I understand you can't index a BAM that isn't sorted by position. This makes life quite difficult, as many tools need the BAM to be indexed. Thanks for the suggestion. I'm not sure how to make it work.

ADD REPLYlink written 3 months ago by olavur60

okay, why do you need to fetch the mate ? what is the aim of your project ?

ADD REPLYlink written 3 months ago by Pierre Lindenbaum110k

I have linked-read data where the reads that are close together have the same barcode. I want to try to check that the paired reads have the same barcode, with the assumption that a barcode mismatch means there were spurious alignments. Among other things, I think this might help me get a better estimate of insert size, assuming that the spurious alignments give false predictions of insert size. I think I will put this project on hold as I feel I lack the understanding to do this properly, but nevertheless I think it would be useful to be able to obtain the mate pairs in an efficient and fast manner.

ADD REPLYlink written 3 months ago by olavur60
0
gravatar for Michi
29 days ago by
Michi930
Barcelona
Michi930 wrote:

I agree that it should be a bit simpler, given that paired end bam files are so common.

I used the following solution, which is only applicable if the .bam file has been previously sorted by read names samtools sort -n!

samfile = AlignmentFile(filename, 'rb')  # BAM file reader.
# Iterate through reads.
read1 = None
read2 = None

for read in samfile:

    if not read.is_paired or read.mate_is_unmapped or read.is_duplicate:
          continue

    if read.is_read2:
        read2 = read
    else:
        read1 = read
        read2 = None
        continue

     if not read1 is None and not read2 is None and read1.query_name == readf2.query_name:
          print("found a pair!")
          ## do your stuff
ADD COMMENTlink modified 29 days ago • written 29 days ago by Michi930
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1481 users visited in the last hour