How To Keep Only Matching Reads From Two Sam Files?
2
4
Entering edit mode
12.2 years ago
Applet ▴ 150

So I have two sam files and I want to delete from each one any reads that are not in the other one.

So I wrote some code using pysam to figure out which reads are in both and write those back out to new sam files.

However my new sam files don't have any headers. So the next time I try to open the files with pysam it aborts saying it can't find headers.

Any idea how to do this? I actually don't understand what the headers are for? Is there a way to get the correct headers into the files I'm creating?

Here's my code if it helps:

def open_as_pysam(self, filepath):
    try:
        return pysam.Samfile(filepath, 'r')
    except:
        return pysam.Samfile(filepath + '.gz','r')

def remove_non_matching_reads(self, sim_samfilepath, sec_samfilepath, delete_original_files=False):
    """Only keep reads that are in both files (by qname)."""
    sim_removed_count, sec_removed_count = 0,0
    sim_reads, sec_reads = set(), set()
    sim_samfile_in = self.open_as_pysam(sim_samfilepath)
    sec_samfile_in = self.open_as_pysam(sec_samfilepath)

    for read in sim_samfile_in.fetch():
        sim_reads.add(read.qname)
    for read in sec_samfile_in.fetch():
        sec_reads.add(read.qname)
    both = sim_reads.intersection(sec_reads)
    print "Found %s reads in sim file" % len(sim_reads)
    print "Found %s reads in sec file" % len(sec_reads)
    print "Found %s reads common to both files" % len(both)

    #write out only those in both
    out_sim_samfilepath = sim_samfilepath + '.noncommon.reads.removed.sam'
    out_sec_samfilepath = sec_samfilepath + '.noncommon.reads.removed.sam'
    sim_outfile = pysam.Samfile(out_sim_samfilepath, 'w', template=sim_samfile_in)
    sec_outfile = pysam.Samfile(out_sec_samfilepath, 'w', template=sec_samfile_in)

    #Close and re-open input files because I don't seem to have access to the seek 
    #method in the pysam files to reset the fetch.
    sim_samfile_in.close()
    sec_samfile_in.close()
    sim_samfile_in = self.open_as_pysam(sim_samfilepath)
    sec_samfile_in = self.open_as_pysam(sec_samfilepath)

    for read in sim_samfile_in.fetch():
        if read.qname in both:
            sim_outfile.write(read)
        else:
            sim_removed_count +=1
    for read in sec_samfile_in.fetch():
        if read.qname in both:
            sec_outfile.write(read)
        else:
            sec_removed_count +=1

    print "Removed %s simulans reads and %s seculans reads" % (sim_removed_count, sec_removed_count)
    sim_outfile.close()
    sec_outfile.close()

    if delete_original_files:
        os.remove(sim_samfilepath)
        os.remove(sec_samfilepath)

    return out_sim_samfilepath, out_sec_samfilepath
sam samtools • 3.7k views
ADD COMMENT
3
Entering edit mode
12.2 years ago
Michael 54k

samtools has a reheader argument:

$ samtools reheader 
Usage: samtools reheader <in.header.sam> <in.bam>

That should put the header of in.header.sam as the header into in.bam. If the header of the original files is correct, you might have success using this.

ADD COMMENT
0
Entering edit mode

You can also use picard CreateSequenceDictionary on the genome fasta and add that to the top of the sam file.

ADD REPLY
2
Entering edit mode
12.2 years ago
Fidel ★ 2.0k

Add 'h' to the line where you get the handler to open the file:

pysam.Samfile(out_sim_samfilepath, 'wh', template=sim_samfile_in)

I have not tried myself, but that's what the API says. See:

http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/api.html

ADD COMMENT
0
Entering edit mode

That seems to work. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1997 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6