Question: Picard Matequery Slows Process To A Crawl
0
gravatar for coryprzy
5.7 years ago by
coryprzy0
coryprzy0 wrote:

I'm looking to iterate through an indexed BAM file using picard and perform various tests on both a read and it's mate. For some I would need the full SAMRecord for the mate so I can't just use the getMate() methods of the record. I can read through the file using the iterator but once I add in a line that creates the mate record the program slows to a crawl. I've tried the methods SAMFileReader methods queryMate(), query(). The rest of the query() methods are some variant of those (queryContained, queryAlignmentStart, queryOverlapping) and all end up calling the same path.

I've traced down that one of the instances of speed loss ocurs when the BAMFileSpan is created in BAMFileReader.createIndexIterator(). Also it appears that no matter what indexing regions are created, an iterator is created that must re-traverse the whole file rather than looking at the file by an offset as would be the case with samtools mpileup.

Is there a way to resolve this? Currently putting one line into my read loop changes the normal read time of about 20 seconds to not completing within hours.

A bare bones version of the code is below. As is, it should run very quickly, even still on the order of minutes for large bam files, while if you comment out either of the methods for finding the mate record, it does not finish under reasonable time scope.

import java.util.Date;

import net.sf.samtools.SAMFileHeader;
import net.sf.samtools.SAMFileReader;
import net.sf.samtools.SAMRecord;
import net.sf.samtools.SAMRecordIterator;
import net.sf.samtools.SAMFileReader.ValidationStringency;
import net.sf.samtools.Cigar;

import org.apache.commons.io.FilenameUtils;

public class main {

public static void main(String[] args) {

    Date startDate = new Date();
    System.out.println("Start Time is " + startDate.toString());

    String baseName = FilenameUtils.getBaseName(args[0]);
    String inputPath = FilenameUtils.getFullPath(args[0]);
    // Create 2 samfile readers from the BAM file
    // note that this needs to be done twice because you cannot query while using the iterator
    SAMFileReader inputSam = null;
    SAMFileReader inputSamForQuery = null;
    try
    {
        inputSam=new SAMFileReader(new File(args[0]), new File(args[0] + ".bai"));
        inputSamForQuery = new SAMFileReader(new File(args[0]), new File(args[0] + ".bai"));
    }
    catch (NullPointerException e1)
    {
        System.out.println("bamFileIn not a valid bam file or lacking *.bai index file");
    }

    inputSam.setValidationStringency(ValidationStringency.SILENT);
    SAMRecordIterator iter=inputSam.iterator();
    SAMFileWriterFactory sf=new SAMFileWriterFactory();
    sf.setCreateIndex(false);

    SAMFileHeader inHeader = inputSam.getFileHeader();

    long counter = 0;

    while(iter.hasNext())
    {
        SAMRecord rec=iter.next();
        int matePos = rec.getMateAlignmentStart();
        int mateRefInd = rec.getMateReferenceIndex();
        String mateName = rec.getMateReferenceName();

        // SAMRecord mateRec = inputSamForQuery.queryMate(rec);
            // ..........or..............
        // SAMRecord mateRec = null;           
        // SAMRecordIterator iterMate = inputSamForQuery.query(mateName, matePos, 0, true);

        // while (iterMate.hasNext())
        // {
        //    mateRec = iterMate.next();
        //    if (mateRec.getReadName().equals(name))
        //    {
        //        break;
        //    }
        // }
        // iterMate.close();             

        Cigar recCigar = rec.getCigar();
        String recCigStr = recCigar.toString();
        // Cigar mateCigar = mateRec.getCigar();
        // String mateCigStr = mateCigar.toString();
        int recPos = rec.getAlignmentStart();
        // int matePos = mateRec.getAlignmentStart();
    }

    iter.close();
    inputSam.close();

    Date endDate = new Date();
    System.out.println("End time is " + endDate.toString());        
}
}
paired-end picard samtools • 1.7k views
ADD COMMENTlink modified 5.7 years ago by Pierre Lindenbaum119k • written 5.7 years ago by coryprzy0
0
gravatar for Pierre Lindenbaum
5.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

you don't need the variables baseName, inputPath, the package FilenameUtils

you don't need

new SAMFileReader(new File(args[0]), new File(args[0] + ".bai"));

the following is safer and enough:

new SAMFileReader(new File(args[0]));

you should exit your program after catch (NullPointerException e1)

if you want't the read and the mate, , I would suggest to order your BAM by NAME: the consecutive reads will be the 'reads' and the 'mate'. You can sort the BAM using picard/SortSam.jar or in your program using the SortingCollection class.

ADD COMMENTlink written 5.7 years ago by Pierre Lindenbaum119k

In the line specified, I was trying to force the user to create an indexed *.bai file before running. The correction you gave will work, but if the file isn't there, it just won't include indexing. For this case sorting the file first will work as a work-around since I'm only ever looking at the read and mate pair. But, am I to assume from the response then that using picard you cannot simply look at region and grab the reads without the overhead processing of iterating back through the file to get there?

ADD REPLYlink written 5.7 years ago by coryprzy0

As long as your trying to get the mate for the properly paired reads (mate is on the same chromosome, not too far away), your could try to buffer the reads. See my code for Reverse read direction in a bam file https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar76892.java

ADD REPLYlink written 5.7 years ago by Pierre Lindenbaum119k

Unfortunately, not necessarily; one of the conditions I'm to be looking for are discordant pairs with at least some cases where the mate is on a different chromosome. I considered two buffers, one similar to what you implemented, and one that stores the SAMRecord for those that aren't 'close enough' with detail about the mate such that every new record searches the list to see if they're a previous mate, prior to executing anything else. But not only is that convoluted but at risk of processing issues as well if the list grows large.

ADD REPLYlink written 5.7 years ago by coryprzy0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1717 users visited in the last hour