Question: Bam : Extract Only Alignment With A Specific Length
1
gravatar for Nicolas Rosewick
7.0 years ago by
Belgium, Brussels
Nicolas Rosewick6.6k wrote:

Hi,

Is it possible to extract from a BAM file, only alignments with a specifi lentgh (ex: length > 60 ) ?

Thanks,

N.

alignment extraction length rna bam • 8.4k views
ADD COMMENTlink written 7.0 years ago by Nicolas Rosewick6.6k
7
gravatar for Marvin
7.0 years ago by
Marvin800
Marvin800 wrote:

Assuming "length of alignment" is measured on the reference, this will do it for SAM:

perl -lane '$l = 0; $F[5] =~ s/(\d+)[MX=DN]/$l+=$1/eg; print if $l > 60'

Combine with samtools as needed, e.g:

samtools view -h foo.bam | perl -lane '$l = 0; $F[5] =~ s/(\d+)[MX=DN]/$l+=$1/eg; print if $l > 60 or /^@/' | samtools view -bS - > bar.bam
ADD COMMENTlink written 7.0 years ago by Marvin800
5
gravatar for Pierre Lindenbaum
7.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum112k wrote:

Here is a solution using the picard API:

import java.io.File;

import net.sf.samtools.SAMFileReader;
import net.sf.samtools.SAMFileWriter;
import net.sf.samtools.SAMFileWriterFactory;
import net.sf.samtools.SAMRecord;
import net.sf.samtools.SAMRecordIterator;
import net.sf.samtools.SAMFileReader.ValidationStringency;

public class Biostar12433 {

    public static void main(String[] args) throws Exception
        {
        if(args.length!=1) return;
        final int LENGTH=60;
        SAMFileReader inputSam=new SAMFileReader(new File(args[0]));
        inputSam.setValidationStringency(ValidationStringency.SILENT);
        SAMRecordIterator iter=inputSam.iterator();
        SAMFileWriterFactory sf=new SAMFileWriterFactory();
        sf.setCreateIndex(false);
        SAMFileWriter sw=sf.makeSAMWriter(inputSam.getFileHeader(),false,System.out);

        while(iter.hasNext())
            {
            SAMRecord rec=iter.next();
            if(rec.getReadUnmappedFlag()) continue;
            int length=rec.getAlignmentEnd()-rec.getAlignmentStart();
            if(length<LENGTH)
                {
                continue;
                }
            sw.addAlignment(rec);
            }
        iter.close();
        inputSam.close();
        sw.close();
        }

}

Compilation:

javac -cp path/to/sam-1.35.jar:path/to/picard-tools-1.35/picard-1.35.jar Biostar12433.java

Execution:

 java -cp path/to/sam-1.35.jar:path/to/picard-tools-1.35/picard-1.35.jar Biostar12433 file.bam > result.sam

Here I've used rec.getAlignmentEnd()-rec.getAlignmentStart(); but you could also use getUnclippedStart/getUnclippedEnd. See http://picard.sourceforge.net/javadoc/net/sf/samtools/SAMRecord.html

ADD COMMENTlink written 7.0 years ago by Pierre Lindenbaum112k

nice! I am still in the world of parsing out the CIGAR strings, need to change that

ADD REPLYlink written 7.0 years ago by Istvan Albert ♦♦ 77k

6.5 years later: use samjdk : http://lindenb.github.io/jvarkit/SamJdk.html . Something like:

java -jar dist/samjdk.jar -e 'return record.getReadUnmappedFlag() || rec.getEnd()-rec.getStart()>=60;' in.bam
ADD REPLYlink modified 10 weeks ago • written 6 months ago by Pierre Lindenbaum112k
5
gravatar for Sven Bilke
7.0 years ago by
Sven Bilke70
Bethesda, MD, USA
Sven Bilke70 wrote:

Here is yet another version, in C, for a change. It implements different length functions and it should be fairly easy to accommodate to whatever your needs are.

#include <stdio.h>
#include <samtools/sam.h>


int len_query(const bam1_t *b) {
  // This one includes soft-clipped nucleotides
  return  bam_cigar2qlen(&b->core, bam1_cigar(b));
}

int len_genome(const bam1_t *b) {
  // length in genome-cooredinates
  return  bam_calend(&b->core, bam1_cigar(b)) - b->core.pos;
}

int len_align(const bam1_t *b)  {
  // pedestrian way, fill in whatever choice you like, currently caculates number of (non-insert, non-delete) 
  // aligned (match & missmatch) nt's of the query
  int i;
  uint32_t *cigar = bam1_cigar(b);
  int l  = 0;
  for(i=0; i < b->core.n_cigar; ++i) {
    const int opl  = cigar[i] >> BAM_CIGAR_SHIFT;
    const int op   = cigar[i] & BAM_CIGAR_MASK;
    if(op == BAM_CMATCH)   l+= opl;
  }
  return l;
}


int main(int argc, char **argv) {
  int (*len_func)(const bam1_t *);
  if(argc < 4) {
    fprintf(stderr, "Insuficient arguments, Usage: %s len mode in.bam out.bam\n", argv[0]);
    exit(-1);
  } 
  switch(*argv[2]) {
  case 'Q': 
    len_func = len_query;
    break;

  case 'A': 
    len_func = len_align;
    break;

  case 'G': 
    len_func = len_genome;
    break;

  default:
    fprintf(stderr, "unsupported mode, known are _Q_uery-length, _A_ligned length and _G_enomic length\n");
    exit(-1);
    break;
  }

  int len=atoi(argv[1]);

  samfile_t *I = samopen(argv[3], "rb", 0);
  if(!I) {
    printf("Can not open sam input %s\n", argv[3]);
    exit(-1);
  }
  samfile_t *O = samopen(argv[4], "wb", I->header);
  if(!O) {
    printf("Can not open sam output %s\n", argv[4]);
    exit(-1);
  }

  bam1_t *b = bam_init1();

  while (samread(I, b) >= 0) {
    int l = len_func(b);
    if(len != len_func(b)) continue;
    samwrite(O, b);
  }
  samclose(I);
  samclose(O);
  bam_destroy1(b);
  exit(0);
}
ADD COMMENTlink written 7.0 years ago by Sven Bilke70

--Hi,

nice usefull code, is there a way to add filters on chromosome and range such as: %s len mode chr range in.bam out.bam

thx

ADD REPLYlink written 10 months ago by lmanchon0
3
gravatar for Ido Tamir
7.0 years ago by
Ido Tamir4.9k
Austria
Ido Tamir4.9k wrote:

Same as Pierres in Scala.

save as filter.scala

start with:

sh filter.scala in.bam out.bam 60

Because the problem is not really complex, it does not show off scalas benefits, apart from being able to call it as a script.

#!/bin/sh
cp='sam-1.50.jar;picard-1.50.jar'
exec scala -classpath $cp "$0" "$@"
!#

import scala.collection.JavaConverters._
import net.sf.samtools._
import net.sf.picard.filter._
import net.sf.picard.io.IoUtil
import net.sf.samtools.util.CloserUtil
import net.sf.samtools.SAMFileReader.ValidationStringency
import java.io.File

class SamFilteredCopy(inName: String, outName: String, filterOut: SAMRecord => Boolean) {
     val input = new SAMFileReader(new File(inName))
     val output = if(outName == "stdout"){
         new SAMFileWriterFactory().makeSAMWriter(input.getFileHeader, false, System.out)
     }else{
         val outFile = new java.io.File(outName)
         IoUtil.assertFileIsWritable(outFile)
         new SAMFileWriterFactory().makeSAMOrBAMWriter(input.getFileHeader, false, outFile)
     }
     input.setValidationStringency(ValidationStringency.SILENT)
     for(record <- input.iterator.asScala if filterOut(record)){
         output.addAlignment(record)            
     }  
     CloserUtil.close(input)
     CloserUtil.close(output)
}

def filter(length: Int)(record: SAMRecord): Boolean = {
    record.getAlignmentEnd - record.getAlignmentStart + 1 > length 
}

if(args.length < 3){
    error("need three arguments: inName, outName (stdout), size")
}
new SamFilteredCopy(args(0),args(1),filter(args(2).toInt) _)
ADD COMMENTlink written 7.0 years ago by Ido Tamir4.9k

Love the scala use in bioinformatics. Wish it was more prevalent!

ADD REPLYlink written 6.2 years ago by mylons130
1
gravatar for Ketil
7.0 years ago by
Ketil3.9k
Germany
Ketil3.9k wrote:

I think this depends on the aligner you use. E.g. bwa will report (AFAICS) the full sequence, and use soft clipping at the ends, while gsMapper will use hard clipping, and report only the matched sequence.

I think you will have to parse the CIGAR field and count (sum up) the number of M's.

ADD COMMENTlink written 7.0 years ago by Ketil3.9k

I used bowtie with -S option and then samtools to convert sam to bam

ADD REPLYlink written 7.0 years ago by Nicolas Rosewick6.6k

You can't just count the M operators - at very least you must also include the X and = characters (mismatch and match), which are a more explicit alternative to M (match or mismatch). In fact you probably should count the I operations too.

It might be simpler to take the read's sequence and deduct any soft trimming (the number of S operations in the CIGAR string)?

ADD REPLYlink written 7.0 years ago by Peter5.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1534 users visited in the last hour