Question: How to systematically check if a bam file is truncated
gravatar for jonessara770
3.2 years ago by
jonessara770200 wrote:


I have downloaded some specific regions from public database bam files. I would like to check if downloaded bam files are ok or truncated before transforming them to fastq files.

I use samtools to check the bam files. As I have several bam files, I would like to systematically check them and output the ones that are truncated... is there any script for checking the bam files in a systematic way?

Thanks S

exomeseq • 11k views
ADD COMMENTlink modified 17 months ago by harish290 • written 3.2 years ago by jonessara770200

Picard ValidateSamFile (can be used with BAM).

ADD REPLYlink written 3.2 years ago by genomax85k

samtools index should throw an error in case the file is truncated.

ADD REPLYlink written 17 months ago by harish290
gravatar for Kramdi
2.7 years ago by
Kramdi160 wrote:

There is also samtools quickcheck, I think it's exactly what you need. They provide this very nice example where several bam files can be tested and the truncated ones are written into a file:

samtools quickcheck -v *.bam > bad_bams.fofn   && echo 'all ok' || echo 'some files failed check, see bad_bams.fofn'
ADD COMMENTlink written 2.7 years ago by Kramdi160

I was using python script to check eol. But this is so much better and cleaner. This is exactly what I needed.

ADD REPLYlink written 2.3 years ago by rushirajm0
gravatar for msimmer92
2.2 years ago by
msimmer92250 wrote:

I don´t know why the samtools quickcheck did not work in my server (it runs but I get nothing in return), but samtools flagstat .bam did. So I will post it here as an alternative way. This gives you some summary statistics about the reads in your file and if you see weird numbers you could spot some problems (for instance, seeing zero reads in all fields could be due to a failure in your process that produced an empty sam/bam file).

samtools flagstat filename.bam

Observation: this accepts both sam and bam for imput, so if you want to check sam files as well you can use it :) .

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by msimmer92250

I believe quickcheck is silent if everything is OK.

ADD REPLYlink written 2.2 years ago by WouterDeCoster44k

Oh, that is an important observation. I did not know. Thank you !

ADD REPLYlink written 2.2 years ago by msimmer92250
gravatar for lakhujanivijay
3.2 years ago by
lakhujanivijay5.1k wrote:

A truncated bam file will NOT produce this result when provided as an input to file command:

>file your.bam
your.bam: gzip compressed data, extra field
ADD COMMENTlink written 3.2 years ago by lakhujanivijay5.1k

Hi Vijay,

I don't think this is correct. A truncated BAM is still recognized by file command as compressed data (file command only determines the file's type, it does not check for EOF markers)

ADD REPLYlink written 2.7 years ago by Kramdi160

Kramdi, agree that this is not the standard way of checking the integrity of a bam file. But a preliminary way to check the compression (very trivial and generic). Many a times you end with a pseudo bam file which is not compressed yet has the extension ".bam". In that case, this will help.

Definitely samtools quickcheck is so far the best way

ADD REPLYlink written 2.3 years ago by lakhujanivijay5.1k
gravatar for msimmer92
17 months ago by
msimmer92250 wrote:

Probably you already checked your file and moved on, but I want to come back to this issue because I have read in more depth the samtools quickcheck manual and I have an objection.

Here you have what the manual says about how it works:

"""Quickly check that input files appear to be intact. Checks that beginning of the file contains a valid header (all formats) containing at least one target sequence and then seeks to the end of the file and checks that an end-of-file (EOF) is present and intact (BAM only).

Data in the middle of the file is not read since that would be much more time consuming, so please note that this command will not detect internal corruption, but is useful for testing that files are not truncated before performing more intensive tasks on them.""""

For example, I have a BAM file that gives no error with samtools quickcheck, but when I see it with samtools flagstat it tells me it's empty (no reads, no nothing, when it's supposed to have the information of 32939748 reads). This is an example of a case where samtools quickcheck can mislead you into thinking there's no problem with your BAM/SAM when there's actually one (you could eventually test it yourself to see what I mean). Probably it's because the BAM still has the parts that samtools quickcheck checks, therefore giving a positive result there.

With all that being said, I recommend Picard Tools's "ValidateSamFile" (works for BAM as well, helped me with my problem). I think it's more comprehensive than the other methods, since it was designed to tackle down this particular issue. In the old syntax this is (check the manual for the new, but this is what I used):

java -jar /path/to/picard.jar ValidateSamFile I= myfile.bam MODE=SUMMARY

Coming back to the example of my problem, ValidateSamFile told me there was a problem: ERROR:MISSING_READ_GROUP. I converted my BAM to SAM and this was the content:

@HD VN:1.3  SO:coordinate

In my script, where this file was generated, the reads were mapped, the sam was produced correctly but there was a mistake in the convertion to BAM. When I checked the BAM with quickcheck, it had the beginning and end-of-file, therefore it gave an answer as if the file was ok, which is wrong. Now, thanks to ValidateSamFile and samtools flagstat I know my script has a problem, but if I had only checked with quickcheck I would think the files are great and I would have continued with my pipeline without further questioning, until hitting myself with the wall, eventually. I also ran samtools quickcheck on this new sam file and also did not output anything (told me everything was ok). So far, then, samtools flagstat helped me visualize that there were no reads, while ValidateSamFile did a more thorough analysis of the errors (plus also gives you warnings, has a long list of things it checks). Good luck!

Edit: I saw that @genomax commented "use Picard ValidateSamFile" without making it an answer, so maybe it passes unattended to readers. I support that answer, but now at least I've registered a more thorough explanation of why that method is better than the other suggested ones.

ADD COMMENTlink modified 17 months ago • written 17 months ago by msimmer92250
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1065 users visited in the last hour