Entering edit mode
6.1 years ago
micro32uvas ▴ 10
I ran a command for the generation of SAM file from my fastq data. Some one just turned my system off. Its >70GB .sam file created. How can I know that the sam file is created completely.
The file isn't corrupt and opens up, but I am afraid it might not be completed, since the terminal is also closed.
Your help is desired here.
I naively assume that your aligner processes reads sequentially as they are present in the fastq file. If so, you could check if the last read in the fastq file is present in the sam file.
But to be entirely sure safest would be to rerun it.
Also: if possible, avoid the creation of a .sam file and directly create a (sorted) .bam That will save in time/memory/intermediate files.
Another advantage of BAM Files is that they have got 28-byte EOF-signature at the end. So any truncation of bamfiles can be easily detected.
Its just sam created so far. I've pipelined bam into it
In fact, that's the point. You should always create (sorted) bam directly by using pipes. Bam files are more easy to handle, easy to detect corruption and take way much lesser space.
One cannot take things for granted in some parts of the world.
But, if it was written to bam, OP would easily find out if it's truncated or not, right?
With Bamfile, of course yes. But am not sure if this was your Q?!
Its driving me crazy, but yes worst case scenerios are always there at the door step just exactly when you set to go all smooth. Biostars is a great blessing though!
That is one good suggestion. I checked it and looks like sam is created sequenctially and tailing the both fastq and sam file reveals same headers. Apparently it seems as it it ran completely. But I would definitely rerun it rather assume about it completion.
Thanks a ton!
Could you guide me to the command for Bam file creation directly, instead of sam file. I wantedto do it directly in a single command with pipelines but it wont work that way.
In addition to the general answer from Santosh, an example for bwa (also doing sorting of bam):
Note the final
-in the command. That's not a typo. That means that samtools has to read from stdin.
Does this only work in single thread mode?
Sorry wrong comment earlier. Deleted.. Why it should not work in single-threads?
You can run both bwa and samtools with multiple threads in this command.
Adjust accordingly to your available infrastructure.
With a construct like this I don't know if the sort does not start until the alignments are complete since sorting is not happening in memory and essentially random alignment data is streaming in from multiple threads. Perhaps there is some programming trick that makes this more efficient to use as a pipe that I do not know about.
I sort my BAM files after they are created.
Not the case. Sorting several partially sorted data is cheaper than completely random data.
Sorting start as soon as it has some data (may be there is a minimum threshold). This you can check by running the above command and monitoring the process by top. You will see that sorting threads top the memory/cpu consumption many times during the whole alignment process.
I'm not sure about how efficient this is, but it avoids intermediate files. Would be interesting to benchmark...
I mostly use
bbmapwhich writes BAM file directly so there are no intermediate files in that sense.
I am going to test to see if there is actually a
hugebenefit to piping. The benefit, if any, will only be there when you have a corresponding high performance storage subsystem. Doing this on a single CPU machine with local disks is going to easily saturate PCI-E bus.
Would be very much interested to know the results of your test
A small test (2433680 reads, 60983363 bases) of single-end miRNA sequence data aligned to GRCh38 genome using BBMap (v. 37.22) and Samtools (v. 1.4.0).
Oh quite nice improvement.
Comment with an implied
Just trying to make sure :)
Well it's 5% time profit, so probably not too bad on bigger datasets (assuming it's a linear gain). But for me the main reason is having no lingering intermediate files.
With BBMap there are no intermediate files (unless you are counting the intermediate files produced by samtools during sort).
Here is another test with a bigger dataset. Genomic sequencing (21802852 reads, 3179327811 bases) aligning to 20 bacterial genomes.
Your unsorted bam is an intermediate file, no?
Slightly unrelated: is there a benchmark comparison of bbmap vs bwa by chance for speed, memory, accuracy?
I don't think of it that way but you are correct.
So, there is not much difference. However, it's difficult to make a direct comparison here because your BBmap is producing Bam, which is a compressed file-format. Sorting it needs decompressing, then sorting, then re-compressing. If a program produces sam, that is already uncompressed, you would be better off direct sorting it, and converting to sorted bam. In this way, you will save time which is for 1) compressing/converting the initial sam to bam 2) decompressing the bam for sorting.
We are on the verge of hijacking this thread so this is my last addition.
BBMap can produce SAM files (by just changing
out=output.bam) and to address your point I added the stat for that method (BBMap SAM in to samtools sort) in the post above. It takes an additional 25 seconds by that method.
Here's what I have been working on: But somehow this isnt working well on bam filling
I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
No viewing? no samtools view command? In that case we are not indexing either. Additionally the bam created here, would be sorted right?
The output of bwa is sam format, which goes directly into samtools sort and gets outputted as bam. The resulting bam can be indexed (but not using pipes!).
It didnt work, following are the errors: sort: invalid option -- 'O' open: No such file or directory [bam_sort_core] fail to open file BAM [M::bwa_idx_load_from_disk] read 0 ALT contigs
I guess your samtools version is outdated.
Generally, there is no need for
-O BAMfor bam (default). From samtools manual:
If a process did not complete cleanly you should consider re-running it again. Time saved now may come back to haunt you later.
there isn't anyway to check the process??? any command? anything?
You could check to see how far the alignments got (based on reads that are in the SAM file). Remove any corrupt records at the end of the file (you may have to use
sedcreatively since you can't open a 70G file in an editor). You could then start a new alignment job with the subset of reads that were not originally aligned. Then finally merge the two alignment files.