How to systematically check if a bam file is truncated
4
6
Entering edit mode
5.9 years ago
jonessara770 ▴ 240

Hi,

I have downloaded some specific regions from public database bam files. I would like to check if downloaded bam files are ok or truncated before transforming them to fastq files.

I use samtools to check the bam files. As I have several bam files, I would like to systematically check them and output the ones that are truncated... is there any script for checking the bam files in a systematic way?

Thanks S

Exomeseq • 24k views
4
Entering edit mode

Picard ValidateSamFile (can be used with BAM).

0
Entering edit mode

samtools index should throw an error in case the file is truncated.

23
Entering edit mode
5.4 years ago
Kramdi ▴ 250

There is also samtools quickcheck, I think it's exactly what you need. They provide this very nice example where several bam files can be tested and the truncated ones are written into a file:

samtools quickcheck -v *.bam > bad_bams.fofn   && echo 'all ok' || echo 'some files failed check, see bad_bams.fofn'

0
Entering edit mode

I was using python script to check eol. But this is so much better and cleaner. This is exactly what I needed.

1
Entering edit mode
4.9 years ago
msimmer92 ▴ 300

I don´t know why the samtools quickcheck did not work in my server (it runs but I get nothing in return), but samtools flagstat .bam did. So I will post it here as an alternative way. This gives you some summary statistics about the reads in your file and if you see weird numbers you could spot some problems (for instance, seeing zero reads in all fields could be due to a failure in your process that produced an empty sam/bam file).

samtools flagstat filename.bam


Observation: this accepts both sam and bam for imput, so if you want to check sam files as well you can use it :) .

2
Entering edit mode

I believe quickcheck is silent if everything is OK.

1
Entering edit mode

Oh, that is an important observation. I did not know. Thank you !

0
Entering edit mode
5.9 years ago

A truncated bam file will NOT produce this result when provided as an input to file command:

>file your.bam
your.bam: gzip compressed data, extra field

0
Entering edit mode

Hi Vijay,

I don't think this is correct. A truncated BAM is still recognized by file command as compressed data (file command only determines the file's type, it does not check for EOF markers)

0
Entering edit mode

Kramdi, agree that this is not the standard way of checking the integrity of a bam file. But a preliminary way to check the compression (very trivial and generic). Many a times you end with a pseudo bam file which is not compressed yet has the extension ".bam". In that case, this will help.

Definitely samtools quickcheck is so far the best way

0
Entering edit mode
4.1 years ago
msimmer92 ▴ 300

Probably you already checked your file and moved on, but I want to come back to this issue because I have read in more depth the samtools quickcheck manual and I have an objection.

Here you have what the manual says about how it works:

"""Quickly check that input files appear to be intact. Checks that beginning of the file contains a valid header (all formats) containing at least one target sequence and then seeks to the end of the file and checks that an end-of-file (EOF) is present and intact (BAM only).

Data in the middle of the file is not read since that would be much more time consuming, so please note that this command will not detect internal corruption, but is useful for testing that files are not truncated before performing more intensive tasks on them.""""

For example, I have a BAM file that gives no error with samtools quickcheck, but when I see it with samtools flagstat it tells me it's empty (no reads, no nothing, when it's supposed to have the information of 32939748 reads). This is an example of a case where samtools quickcheck can mislead you into thinking there's no problem with your BAM/SAM when there's actually one (you could eventually test it yourself to see what I mean). Probably it's because the BAM still has the parts that samtools quickcheck checks, therefore giving a positive result there.

With all that being said, I recommend Picard Tools's "ValidateSamFile" (works for BAM as well, helped me with my problem). https://software.broadinstitute.org/gatk/documentation/article.php?id=7571. I think it's more comprehensive than the other methods, since it was designed to tackle down this particular issue. In the old syntax this is (check the manual for the new, but this is what I used):

java -jar /path/to/picard.jar ValidateSamFile I= myfile.bam MODE=SUMMARY


Coming back to the example of my problem, ValidateSamFile told me there was a problem: ERROR:MISSING_READ_GROUP. I converted my BAM to SAM and this was the content:

@HD VN:1.3  SO:coordinate


In my script, where this file was generated, the reads were mapped, the sam was produced correctly but there was a mistake in the convertion to BAM. When I checked the BAM with quickcheck, it had the beginning and end-of-file, therefore it gave an answer as if the file was ok, which is wrong. Now, thanks to ValidateSamFile and samtools flagstat I know my script has a problem, but if I had only checked with quickcheck I would think the files are great and I would have continued with my pipeline without further questioning, until hitting myself with the wall, eventually. I also ran samtools quickcheck on this new sam file and also did not output anything (told me everything was ok). So far, then, samtools flagstat helped me visualize that there were no reads, while ValidateSamFile did a more thorough analysis of the errors (plus also gives you warnings, has a long list of things it checks). Good luck!

Edit: I saw that @genomax commented "use Picard ValidateSamFile" without making it an answer, so maybe it passes unattended to readers. I support that answer, but now at least I've registered a more thorough explanation of why that method is better than the other suggested ones.