Question: Remove all reads failing in ValidateSamFile
0
gravatar for danny
3 months ago by
danny0
danny0 wrote:

My bam has a lot of bad reads that cause it to fail the GATK. I would like to remove them. How can I programmatically remove the reads identified by ValidateSamFile as causing errors?

sam bam gatk • 211 views
ADD COMMENTlink modified 3 months ago by Pierre Lindenbaum106k • written 3 months ago by danny0
0
gravatar for dariober
3 months ago by
dariober9.0k
Glasgow - UK
dariober9.0k wrote:

Assuming you are happy to discard the failed reads rather than correcting them, you could set the MAX_OUTPUT option to a large value so to get a list of failed records. If I'm not mistaken you get the record position in the file, like (example from here):

ERROR: Record 1, Read name 20FU...
ERROR: Record 3, Read name 20FU...
ERROR: Record 6, Read name 20GA...

Then pass through the file again and discard the records failing records. This may require writing a little script that parses the output of ValidateSam to get the record numbers to discard (1, 3, 6, ... in the example above) and then read and write the bam file excluding those indexes. (Maybe there is an off-the-shelf tool for all this...)

If you have paired end reads, you may create reads that have no mate which in turn makes the bam file still invalid. I'm not sure if samtools fixmate can fix that.

But again, in practice it may be easier and better to recreate the bam files without broken records in the first place...

ADD COMMENTlink written 3 months ago by dariober9.0k
0
gravatar for Pierre Lindenbaum
3 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum106k wrote:

using samjdk: http://lindenb.github.io/jvarkit/SamJdk.html

  java -jar  samjdk.jar -e 'List<SAMValidationError> errors = record.isValid(false);return (errors==null || errors.isEmpty());' input.bam

or you can ask GATK to be lenient with errors. I think it's -S LENIENT

ADD COMMENTlink written 3 months ago by Pierre Lindenbaum106k

Hi Pierre, this is great - can you give an example of what <samvalidationerrors> is supposed to look like? And can this tool also remove the mate of a read that is failing?

Also, with regards to another question, could one use this tool to remove reads where the read ID occurs more than twice? I have some legacy bams with bad formatting I am trying to work with. Thanks!

ADD REPLYlink written 3 months ago by danny0

an you give an example of what <samvalidationerrors> is supposed to look like?

<samvalidationerrors> is not a placeholder but a concrete java class https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/SAMValidationError.java

Also, with regards to another question,

ask this as a new question. Search biostars if it was asked before.

ADD REPLYlink written 3 months ago by Pierre Lindenbaum106k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 754 users visited in the last hour