Question: Is my file created completely
0
gravatar for micro32uvas
2.6 years ago by
micro32uvas10
Pakistan
micro32uvas10 wrote:

Hello,

I ran a command for the generation of SAM file from my fastq data. Some one just turned my system off. Its >70GB .sam file created. How can i Know that the sam file is created completely.

the file isn't corrupt and opens up. but I am afraid it might not be completed, since the terminal is also closed.

Your help is desired here...

Regards,

bwa mem sam file • 791 views
ADD COMMENTlink modified 2.6 years ago by h.mon28k • written 2.6 years ago by micro32uvas10
3

I naively assume that your aligner processes reads sequentially as they are present in the fastq file. If so, you could check if the last read in the fastq file is present in the sam file.

But to be entirely sure safest would be to rerun it.

Also: if possible, avoid the creation of a .sam file and directly create a (sorted) .bam That will save in time/memory/intermediate files.

ADD REPLYlink written 2.6 years ago by WouterDeCoster42k
1

Another advantage of BAM Files is that they have got 28-byte EOF-signature at the end. So any truncation of bamfiles can be easily detected.

ADD REPLYlink written 2.6 years ago by Santosh Anand5.0k

Its just sam created so far. I've pipelined bam into it

ADD REPLYlink written 2.6 years ago by micro32uvas10

Its just sam created so far

In fact, that's the point. You should always create (sorted) bam directly by using pipes. Bam files are more easy to handle, easy to detect corruption and take way much lesser space.

ADD REPLYlink written 2.6 years ago by Santosh Anand5.0k

Some one just turned my system off.

One cannot take things for granted in some parts of the world.

ADD REPLYlink written 2.6 years ago by genomax75k

But, if it was written to bam, OP would easily find out if it's truncated or not, right?

ADD REPLYlink written 2.6 years ago by WouterDeCoster42k

With Bamfile, of course yes. But am not sure if this was your Q?!

ADD REPLYlink written 2.6 years ago by Santosh Anand5.0k

Its driving me crazy, but yes worst case scenerios are always there at the door step just exactly when you set to go all smooth. Biostars is a great blessing though!

ADD REPLYlink written 2.6 years ago by micro32uvas10

That is one good suggestion. I checked it and looks like sam is created sequenctially and tailing the both fastq and sam file reveals same headers. Apparently it seems as it it ran completely. But I would definitely rerun it rather assume about it completion.

Thanks a ton!

ADD REPLYlink written 2.6 years ago by micro32uvas10

Hi Wouter,

Could you guide me to the command for Bam file creation directly, instead of sam file. I wantedto do it directly in a single command with pipelines but it wont work that way.

ADD REPLYlink written 2.6 years ago by micro32uvas10
1
what_ever_is_producing_sam | samtools view -bS > my.bam
ADD REPLYlink written 2.6 years ago by Santosh Anand5.0k
1

In addition to the general answer from Santosh, an example for bwa (also doing sorting of bam):

bwa mem genome.fa reads.fastq | samtools sort -O BAM -o output.bam -

Note the final - in the command. That's not a typo. That means that samtools has to read from stdin.

ADD REPLYlink written 2.6 years ago by WouterDeCoster42k

Does this only work in single thread mode?

ADD REPLYlink written 2.6 years ago by genomax75k

Sorry wrong comment earlier. Deleted.. Why it should not work in single-threads?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Santosh Anand5.0k

You can run both bwa and samtools with multiple threads in this command.

bwa mem -t 48 genome.fa reads.fastq | samtools sort -@ 20 -O BAM -o output.bam -

Adjust accordingly to your available infrastructure.

ADD REPLYlink written 2.6 years ago by WouterDeCoster42k

With a construct like this I don't know if the sort does not start until the alignments are complete since sorting is not happening in memory and essentially random alignment data is streaming in from multiple threads. Perhaps there is some programming trick that makes this more efficient to use as a pipe that I do not know about.

I sort my BAM files after they are created.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by genomax75k
1

I don't know if the sort does not start until the alignments are complete

Not the case. Sorting several partially sorted data is cheaper than completely random data.

Sorting start as soon as it has some data (may be there is a minimum threshold). This you can check by running the above command and monitoring the process by top. You will see that sorting threads top the memory/cpu consumption many times during the whole alignment process.

ADD REPLYlink written 2.6 years ago by Santosh Anand5.0k

I'm not sure about how efficient this is, but it avoids intermediate files. Would be interesting to benchmark...

ADD REPLYlink written 2.6 years ago by WouterDeCoster42k
2

I mostly use bbmap which writes BAM file directly so there are no intermediate files in that sense.

I am going to test to see if there is actually a huge benefit to piping. The benefit, if any, will only be there when you have a corresponding high performance storage subsystem. Doing this on a single CPU machine with local disks is going to easily saturate PCI-E bus.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by genomax75k

Would be very much interested to know the results of your test

ADD REPLYlink written 2.6 years ago by Santosh Anand5.0k
1

A small test (2433680 reads, 60983363 bases) of single-end miRNA sequence data aligned to GRCh38 genome using BBMap (v. 37.22) and Samtools (v. 1.4.0).

        1. BBMap to generate a BAM (16 threads) 
        real    10m34.955s
        user    158m34.678s
        sys     0m39.217s

        2. Sort the BAM file (16 threads)
        real    0m3.928s
        user    0m13.162s
        sys     0m0.525s

        3. BBMap generated BAM piped into Samools sort (16 threads each)
        real    9m53.137s
        user    156m5.185s
        sys     0m32.050s

        4. BBMap generated SAM piped into Samools sort (16 threads each)
        real    11m15.153s
        user    170m3.938s
        sys     0m35.918s
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by genomax75k

Oh quite nice improvement.

ADD REPLYlink written 2.6 years ago by WouterDeCoster42k

Comment with an implied /s?

Just trying to make sure :)

ADD REPLYlink written 2.6 years ago by genomax75k

Well it's 5% time profit, so probably not too bad on bigger datasets (assuming it's a linear gain). But for me the main reason is having no lingering intermediate files.

ADD REPLYlink written 2.6 years ago by WouterDeCoster42k
1

With BBMap there are no intermediate files (unless you are counting the intermediate files produced by samtools during sort).

Here is another test with a bigger dataset. Genomic sequencing (21802852 reads, 3179327811 bases) aligning to 20 bacterial genomes.

1. BBMap to generate a BAM (16 threads)
real    6m13.342s
user    106m57.934s
sys     0m21.866s

2. Sort the BAM file (16 threads)
real    1m9.774s
user    5m50.328s
sys     0m20.268s

3. BBMap generated BAM piped into Samools sort (16 threads each)
real    6m51.262s
user    102m12.606s
sys     0m32.568s

4. BBMap generated SAM piped into Samools sort (16 threads each)
real    7m15.386s
user    105m30.430s
sys     0m41.786s
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by genomax75k

Your unsorted bam is an intermediate file, no?

Slightly unrelated: is there a benchmark comparison of bbmap vs bwa by chance for speed, memory, accuracy?

ADD REPLYlink written 2.6 years ago by WouterDeCoster42k

I don't think of it that way but you are correct.

ADD REPLYlink written 2.6 years ago by genomax75k

So, there is not much difference. However, it's difficult to make a direct comparison here because your BBmap is producing Bam, which is a compressed file-format. Sorting it needs decompressing, then sorting, then re-compressing. If a program produces sam, that is already uncompressed, you would be better off direct sorting it, and converting to sorted bam. In this way, you will save time which is for 1) compressing/converting the initial sam to bam 2) decompressing the bam for sorting.

ADD REPLYlink written 2.6 years ago by Santosh Anand5.0k
1

We are on the verge of hijacking this thread so this is my last addition.

BBMap can produce SAM files (by just changing out=output.sam instead of out=output.bam) and to address your point I added the stat for that method (BBMap SAM in to samtools sort) in the post above. It takes an additional 25 seconds by that method.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by genomax75k

Here's what I have been working on: But somehow this isnt working well on bam filling

#!/bin/bash
bwa mem -t 48 -M ref.fa data_1.fastq data_2.fastq >mini-data.sam
samtools view -b -S -F 2308 -t 48 data.sam >data_unsorted.bam
samtools sort data_unsorted.bam data| samtools index data.bam
ADD REPLYlink modified 2.6 years ago by WouterDeCoster42k • written 2.6 years ago by micro32uvas10

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 2.6 years ago by WouterDeCoster42k

No viewing? no samtools view command? In that case we are not indexing either. Additionally the bam created here, would be sorted right?

ADD REPLYlink written 2.6 years ago by micro32uvas10

The output of bwa is sam format, which goes directly into samtools sort and gets outputted as bam. The resulting bam can be indexed (but not using pipes!).

ADD REPLYlink written 2.6 years ago by WouterDeCoster42k
**bwa mem -t 48 genome.fa reads.fastq | samtools sort -@ 20 -O BAM -o output.bam -**

It didnt work, following are the errors: sort: invalid option -- 'O' open: No such file or directory [bam_sort_core] fail to open file BAM [M::bwa_idx_load_from_disk] read 0 ALT contigs

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by micro32uvas10

I guess your samtools version is outdated.

ADD REPLYlink written 2.6 years ago by WouterDeCoster42k

Generally, there is no need for -O BAM for bam (default). From samtools manual:

By default, samtools tries to select a format based on the -o filename extension; if output is to standard output or no format can be deduced, bam is selected.

ADD REPLYlink written 2.6 years ago by Santosh Anand5.0k
1

If a process did not complete cleanly you should consider re-running it again. Time saved now may come back to haunt you later.

ADD REPLYlink written 2.6 years ago by genomax75k

there isn't anyway to check the process??? any command? anything?

ADD REPLYlink written 2.6 years ago by micro32uvas10

You could check to see how far the alignments got (based on reads that are in the SAM file). Remove any corrupt records at the end of the file (you may have to use sed creatively since you can't open a 70G file in an editor). You could then start a new alignment job with the subset of reads that were not originally aligned. Then finally merge the two alignment files.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by genomax75k
0
gravatar for h.mon
2.6 years ago by
h.mon28k
Brazil
h.mon28k wrote:

You may try BamHash. It may be faster and/or safer to just map again, though.

ADD COMMENTlink written 2.6 years ago by h.mon28k

There was no BAM file in this case and it is not clear if BamHash can be used with SAM files (though it can be used with FASTQ files).

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by genomax75k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 716 users visited in the last hour