Tool: Filtering BAM files efficiently
0
gravatar for ddeemer
12 weeks ago by
ddeemer0
ddeemer0 wrote:

I do a lot of manipulation to BAM/SAM file formats and came across some questions regarding how to filter these results efficiently. Below, I have a python script that I use to filter based on the reference_name section of the BAM file. Note: I work in Metagenomics so many times I am aligning reads to a genome scaffold, which contains many different contigs. One application of this tool would be to filter out a BAM file to only contain contigs that were placed in a metagenome-assembled genome bin.

import sys
import argparse
import pysam


def readBinID(binid, keep_bin=False):
    contigList = [0] * 400000
    with open(binid) as b:
        line = b.readline().strip()
        while line:
            if keep_bin:
                bin_name = line.split('\t')[0]
                if bin_name in keep_bin:
                    contig = int(line.split('\t')[1].split('_')[1])
                    contigList[contig] = 1
            else:
                contig = int(line.split('\t')[1].split('_')[1])
                contigList[contig] = 1
            line = b.readline().strip()
    return contigList


def filterBam(inBam, outBam, contigs, keep_bin=False):
    contigList = readBinID(contigs, keep_bin)
    bam = pysam.AlignmentFile(inBam)
    obam = pysam.AlignmentFile(outBam, 'wb', template=bam)

    for b in bam.fetch(until_eof=True):
        try:
            index = int(b.reference_name.split('_')[1])
        except AttributeError:
            pass
        if contigList[index] == 1:
            obam.write(b)

    bam.close()
    obam.close()
    return 0


if __name__ == '__main__':
    from datetime import datetime
    start = datetime.now()
    parser = argparse.ArgumentParser(description="Parser")
    parser.add_argument("-i", "--Input",
                        help="Input .BAM file to filter.",
                        default=sys.stdin, required=False)
    parser.add_argument("-o", "--Output",
                        help="Output .BAM file to write to.",
                        required=True)
    parser.add_argument("-c", "--Contigs", help="File containing reference \
    identifiers and the associated bin (each on its own line & tab-separated)",
                        required=True)
    parser.add_argument("-k", "--Keep",
                        help="Keep only contigs in the specified bins.",
                        nargs="*", required=False, default=False)
    argument = parser.parse_args()
    filterBam(argument.Input, argument.Output,
              argument.Contigs, keep_bin=argument.Keep)
    msg = f"{datetime.now() - start}\n"
    with open('log.txt', 'a') as log:
        log.write(msg)

The idea here is that the first function, readBinID, just takes in a simple tab-delimited file containing all of the contigs of interest. This file has 2 fields: bin the contig belongs in, and the contig name. This could instead be easily manipulated to be a file containing a list of all the read ids, for example, and then the filterBam function would just need to change from 'b.reference_name' to 'b.query_name' I believe. There's functionality in this for selecting all contigs in the file or specifying based on being in a certain bin. What's most import here is initializing a list full of 0s and finding a way to code each contig so you know where each is in the list. With the programs I work with, my contigs are easily labeled from 1 - last contig, so pretty direct.

For a 2Gb bam file containing reads mapped to ~300,000 different references, it takes around 1-2 minutes at most to filter with 1 CPU (memory requirements are super low).

If anyone has any ideas on how to speed this up I'd greatly appreciate it. Hope this may help someone.

DGD

tool bam sam alignment genome • 188 views
ADD COMMENTlink written 12 weeks ago by ddeemer0
1

how is it different from samtools view -M -L in.bed ?

ADD REPLYlink written 12 weeks ago by Pierre Lindenbaum134k

Yeah it is essentially the same concept and works just as well (maybe better, haven't tested?). I think the reason I use my solution stems from working in Metagenomics, where I deal with lots and lots of different metagenome-assembled genomes (MAGs); in order to keep track of which contigs make up which MAGs, I use a simple 2-field file containing: <contig\tmag\n> and don't need all the information in a bed file. Since I'm normally working with upwards of 1 million contigs, it makes the files slightly smaller. I understand I could still use a .bed file for this, but other programs have adopted this format as input for various things. Programs like Anvi'o have strict requirements on the information a BAM can contain and use the same 'MAG-identification' file format I do, so I've used this for that application. In general you can do a lot with a BAM file just by tweaking this script slightly, depending on what you want to do, so I decided to share.

ADD REPLYlink written 12 weeks ago by ddeemer0

the iteration of python is to much slow, so I think you can use some other language like perl or bash, to warp samtools to do that

ADD REPLYlink written 12 weeks ago by xiaoguang50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2138 users visited in the last hour
_