Question: Efficiently extract alignments with NH:i:1 field
1
gravatar for rubic
3.8 years ago by
rubic180
United States
rubic180 wrote:

Pretty basic question.

 

I have bam files in which the only information that indicates whether an alignment is unique is it's NH field. I'm interested in keeping only unique alignments, hence only alignments with NH:i:1 field. Can samtools do this? If not is there any other efficient way other than using grep?

 

Thanks

rna-seq tool alignment • 4.7k views
ADD COMMENTlink modified 2.6 years ago by abl07190 • written 3.8 years ago by rubic180

Grep is pretty darn efficient. I can't think of a faster way to look at each read than to look at each read. Fundamentally any alternative would do the same.

ADD REPLYlink written 3.8 years ago by karl.stamm3.4k

Would grep really be as efficient as using samtools view if that was an option?

ADD REPLYlink written 3.8 years ago by rubic180

I meant to run samtools view and pipe through grep.  If samtools has a filter flag for your tag of interest that would be a tiny bit faster. If your tag of interest is obscure then you've got to grep it. grep has a non-regex fixed mode that is very fast, but nothing is going to read a BAM faster than samtools view.

 

ADD REPLYlink written 3.8 years ago by karl.stamm3.4k

let's assume my aligner writes the sam output to standard output. Is there a one-liner that would pipe and grep and this directly into a bam file (including the header)?

 
For example, if to pipe my alignment output to bam I use:
>align <fastq> | samtools view -Sb - > <bam>
 
How can I change it to pipe only alignments with NH = 1?
 
ADD REPLYlink written 3.8 years ago by rubic180
1

This should work, but there are probably faster ways...

align <fastq | grep -P '^@|NH:i:1\b' | samtools view -Sb - > <bam>
ADD REPLYlink written 3.8 years ago by thackl2.6k

Thanks thackl. It is indeed pretty slow. If anyone is following and knows a samtools command which is faster for achieving this that'll be great.

ADD REPLYlink written 3.8 years ago by rubic180
2

If you want bam2bam the following grep is about  10x faster than my previous idea. Roughly 1 minute / GB bam on my system. Depending on what type of alignments are in your bam, the samtools filter flags (no unmapped, only primary, no supplement - these cannot be unique mappings anyway) might speed up things as well.

(samtools view -H in.bam; samtools view -F 2308 in.bam | grep -w 'NH:i:1') | samtools view -bS - > out.bam
ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by thackl2.6k
1
Pardon my ignorance in linux. Is the full command you're suggesting this:
align <fastq> | samtools view -Sb (samtools view -H in.bam; samtools view -F 2308 in.bam | grep -w 'NH:i:1') | samtools view -bS - > out.bam 
ADD REPLYlink written 3.8 years ago by rubic180
2
gravatar for Friederike
3.8 years ago by
Friederike3.6k
United States
Friederike3.6k wrote:

No idea about how this compares speed-wise, but python is a bit more flexible and easier to use when filtering flags, I find. The pysam packages uses a wrapper around samtools to read in and write out BAM files.

I wrote a script once for checking all kinds of flags, tried to strip it down to what you might be interested in (haven't tested it though)

#!/usr/bin/env python

import sys
import argparse
import getopt
import pysam
import os

def get_args():
    parser=argparse.ArgumentParser(description='filter BAM files')
    parser.add_argument('--BAMfile', '-in', type=str, required=True, help="BAM file")
    parser.add_argument('--outfile', '-out', type=str, required=True, help = 'Name for output file')

    args=parser.parse_args()
    return args

def remove_multiAlignments(ReadTagsEntry, KeepInfo):
    '''Checks for NH and XS tag to remove reads that aligned more than once'''
    
    keepRead = KeepInfo
    
    if 'XS' in ReadTagsEntry:
        keepRead=False
    elif 'NH' in ReadTagsEntry and ReadTagsEntry[1] > 1:
        keepRead=False
    
    return(keepRead)

def main():
    args = get_args()
    
    infile = pysam.Samfile(args.BAMfile, "rb")
    out = pysam.Samfile(args.outfile , "wb", template = infile )
    
    keep = True

    for DNAread in infile.fetch():
        intags = DNAread.tags
         
         for entry in intags:
            keep = remove_multiAlignments(entry, keep)

        if keep:
            out.write(DNAread)

        keep = True

    out.close()

if __name__ == '__main__':
main()
ADD COMMENTlink written 3.8 years ago by Friederike3.6k
0
gravatar for 5heikki
3.8 years ago by
5heikki8.4k
Finland
5heikki8.4k wrote:
I don't remember the exact flags but basically samtools view -bS yourFile | awk '$X != ""' where x corresponds to the column that shouldn't be empty..
ADD COMMENTlink written 3.8 years ago by 5heikki8.4k
0
gravatar for abl0719
2.6 years ago by
abl07190
abl07190 wrote:

The filter function from Bamtools (https://github.com/pezmaster31/bamtools) may be a better choice here. Since the bam file is no decompressed, it should be faster than grep.

ADD COMMENTlink written 2.6 years ago by abl07190

Good idea, but actually, on my computer, bamtools is about 3 times slower than grep:

$ time (bamtools filter -in example.bam -tag NM:1 > /dev/null )

real    0m36.750s
user    0m36.512s
sys 0m0.216s

$ time (samtools view -h example.bam | grep '^@|NM:i:0' > /dev/null)

real    0m12.197s
user    0m14.432s
sys 0m0.788s

(I ran the commands multiple times, so the BAM file is already in the cache memories of whichever component of the computer its operating system.)

I am using the bamtools binary from Debian 8 (Jessie).

$ bamtools --version

bamtools 2.3.0
Part of BamTools API and toolkit
Primary authors: Derek Barnett, Erik Garrison, Michael Stromberg
(c) 2009-2012 Marth Lab, Biology Dept., Boston College

$ dpkg -l bamtools
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                      Version           Architecture      Description
+++-=========================-=================-=================-========================================================
ii  bamtools                  2.3.0+dfsg-2      amd64             toolkit for manipulating BAM (genome alignment) files
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Charles Plessy2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1199 users visited in the last hour