Memory use in indexing
0
0
Entering edit mode
4.3 years ago

Hello,

I indexed the human grch38 assembly using BWA with the command

bwa index -a bwtsw <file>


I used a computer cluster and at the end, I obtained a use of ~75 Gb of memory. My PC has a memory of 64 Gb.

Does this mean that if I had indexed the file at home, the PC would have crashed? (BWA exits the computation when there is not enough memory allocated).

Is there a way to handle the memory usage so that the process can be carried out at home with the available resources?

Thank you

Assembly software error • 4.6k views
0
Entering edit mode

With 64Gb of RAM you can index 15 times the human genome using BWA

And btw a computer do not crash for this reason, if you do not have enought memory to run a program, it will just fail, the computer will be all good.

From here :

Memory Requirement

With bwtsw algorithm, 5GB memory is required for indexing the complete human genome sequences. For short reads, the aln command uses ~3.2GB memory and the sampe command uses ~5.4GB.

0
Entering edit mode

So why the cluster reported a total use of 75 GB? I previously tried with my PC at 48 GB but BWA stopped saying that it could not allocate several GB of memory...

0
Entering edit mode

which fasta file are you indexing? What's the name and where did you find it?

0
Entering edit mode
0
Entering edit mode

What is the size of your reference file ?

0
Entering edit mode

it is 1 GB zipped and 59.1 GB expanded

0
Entering edit mode

Something seems odd. Are you the only user on this machine with 64G RAM?

0
Entering edit mode

yes. actually, I did not installed the full 64GB yet, I still have 48 GB -- I am not sure if the investment of another 16 GB supplement would actually help. I launched the process again and the memory use increased to 29 GB (the gnome+chrome at rest consume about 3 GB, the rest was BWA) at 10 000 iterations but the next phase crashed :

...
[bwt_gen] Finished constructing BWT in 10271 iterations.
[bwa_index] 88277.30 seconds elapse.
[bwa_index] Update BWT... [bwt_bwtupdate_core] Failed to allocate 51074893080 bytes at bwtindex.c line 158: Cannot allocate memory


I interpret the error as the machine requires 51 GB of memory, that's why I wanted to reach 64 GB. But the multi-core computer used ~72 GB. for sure it's a lot above 5 GB.

0
Entering edit mode

I finally increased the computer's memory to 62.9 GiB, I launched bwa index -bwtsw <file> but this time I got another type of error:

[bwa_index] Pack FASTA... 639.11 sec
[bwa_index] Construct BWT for the packed sequence...
Floating point exception (core dumped)


What might have gone wrong this time?

0
Entering edit mode

I don't know how you manage your memory on your cluster but this error is still a memory exception error or a 0 division. Do you have more information about your error ? A complete log file ?

0
Entering edit mode

I am not managing it, I am simply running the same commands following the instructions from the manual; the rest is done by the operative system (Ubuntu in this case). How would I get the log file? those three lines were all that BWA printed out...

2
Entering edit mode

You are missing -a before bwtsw in your indexing command. What happens if you do bwa index -a bwtsw your_file?

0
Entering edit mode

good spot, I copied from the original post but I forgot -a over there. I re-edited the post.

Anyway, the outcome is the same:

...
[bwt_gen] Finished constructing BWT in 10271 iterations.
[bwa_index] 87812.46 seconds elapse.
[bwa_index] Update BWT...
[bwt_bwtupdate_core] Failed to allocate 51 074 893 080 bytes
at bwtindex.c line 158: Cannot allocate memory


and yet I now have available 6.7538E+10 bytes! What there is in btwtindex.c?

0
Entering edit mode

At line 158, that will be a command line to allocate memory in C language. I don't really know why your computer is gobbling your memory like this for a task like that...

0
Entering edit mode

marongiu.luigi : On a different note bwa seems to accept gzipped sequence file. I am running an indexing operation with 50g of RAM to see what happens.

It is odd since hg38 should not be that much bigger than hg19 in terms of size.

0
Entering edit mode

You could try to pick up one chromosome from your input file and to index it, to see how your computer react with a smaller input file.

0
Entering edit mode

I can index smaller genomes with ease; I even indexed hg19. The problem arose with hg38.

0
Entering edit mode

BTW, a friend of mine also tried the indexing of GRCh38 to check, and also in his machine, BWA ate all the 40 something GB of RAM. So we have two independent operators and two completely different machines with the same issue.

0
Entering edit mode

It does not work even with bwa index <file>:

...
[BWTIncConstructFromPacked] 10270 iterations done. 102149168778 characters processed.
[bwt_gen] Finished constructing BWT in 10271 iterations.
[bwa_index] 87677.42 seconds elapse.
[bwa_index] Update BWT... [bwt_bwtupdate_core] Failed to allocate 51074893080 bytes at bwtindex.c line 158: Cannot allocate memory

0
Entering edit mode

Could you give us a link to your file, the complete command you used and the bwa version please.

0
Entering edit mode

I already gave the location of the file: ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz. I have chosen this file based on the post remarking that Given the choice, use the Ensembl annotation. I renamed the file grch38.fa to short it up. The version of BWA is Version: 0.7.17-r1188. The commands are either bwa index -a bwtsw grch38.fa or bwa index grch38.fa.

0
Entering edit mode

Did you unzip and rename or just rename?

0
Entering edit mode

Unzipped and renamed. Dr Heng Li also sent me this link, I am trying this new version right now. I'll post the result...

0
Entering edit mode

I can confirm that the reference you shared above also caused my indexing to fail (16Gbyte RAM). The GRCh38 reference from the blog posted linked above is building fine - it seems.

0
Entering edit mode

Unbelievable, with the new file it was a piece of cake:

[BWTIncConstructFromPacked] 680 iterations done. 6184133946 characters processed.
[bwt_gen] Finished constructing BWT in 688 iterations.
[bwa_index] 2308.57 seconds elapse.
[bwa_index] Update BWT... 20.25 sec
[bwa_index] Pack forward-only FASTA... 18.81 sec
[bwa_index] Construct SA from BWT and Occ... 1292.45 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index GRCh38.fa
[main] Real time: 3717.699 sec; CPU: 3668.031 sec


It took more or less half an hour and less than 10 GB RAM overall. This new file from NCBI has 6 184 133 946 characters instead of the 102 149 168 778 of the EMBL's. It is true that, nevertheless, they carry the same information?

0
Entering edit mode

Have you read the blog post of Heng Li? It explains what is included and what not. Alternatively, check for yourself which contigs/chromosomes are included in a reference fasta:

zgrep '^>' genome.fa.gz

0
Entering edit mode

Yes I did, I just wanted a confirmation since seems incredible to me that 50 Gb of information could be actually shrunk to 3 Gb. Anyway, case closed: the indexing with BWA was finally done.

0
Entering edit mode

Further on this, is there also a VCF file associated with the NCBI reference genome? I tried with ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz because it comes from the same source (NCBI), but I had problems with the headers while running GATK A USER ERROR has occurred: Input files reference and reads have incompatible contigs: No overlapping contigs found.