BWA not recognizing T2T-CHM13 index file
1
0
Entering edit mode
3.2 years ago
abbey ▴ 210

I am re-aligning the same .fastq files to both hg38 and CHM13 to compare the alignments. My hg38 alignment runs fine but CHM13 is getting this error: [E::bwa_idx_load_from_disk] fail to locate the index files

Code:

/app/software/BWA/0.7.17-GCC-8.3.0/bin/bwa mem -t 4 -M -R @RG\tID:no_id\tLB:no_library\tPL:no_platform\tPU:no_unit\tSM:test /grp/reference/T2T-CHM13/RefSeq_v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna results/fastqs/test_fastq1.fq.gz results/fastqs/test_fastq2.fq.gz

We have tried reindexing with bwa index GCF_009914755.1_T2T-CHM13v2.0_genomic.fna multiple times and received the same error. A colleague in a different group has reported the same issues with using BWA to align to CHM13.

I assume this is a formatting issue but I don't know what the incompatibility is. My understanding is that .fa and .fna are equivalent formats, and BWA builds a .fai from our .fna without throwing any errors. CHM13 is 3.0 GB total so the chromosomes should not be over the 2 GB limit. Any suggestions on how to trace the source of this issue?

Note: I've also submitted an issue on the BWA github page. Please let me know if there's a more appropriate forum for me to ask this!

CHM13 BWA T2T alignment • 5.1k views
ADD COMMENT
1
Entering edit mode

bwa github is basically dead, maybe an ambitious user gives you an answer but don't wait for the developer.

Can you share the logs of the indexing and the output of ls /grp/reference/T2T-CHM13/RefSeq_v2.0/.

The suffix of the file has no meaning, bwa will happily index a file named football.123uga if it is in proper fasta format, so that cannot be the issue. My best guess is that some memory shortage got the indexing job killed without that it got spotted while maybe some corrupted files have already been produced.

ADD REPLY
0
Entering edit mode

That's about what I expected re: github.

I don't think we have any logs saved from the indexing but I can rerun it tonight. Here's the reference directory:

athorpe@rhino02:/fh/fast/ha_g/grp/reference/T2T-CHM13/RefSeq_v2.0$ ls -lh
total 9.0G
-rw-r----- 1 rpatton ha_g_grp 3.0G Jul 13 14:17 GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
-rw-rw---- 1 rpatton ha_g_grp   16 Jul 25 17:21 GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.amb
-rw-rw---- 1 rpatton ha_g_grp 2.7K Jul 25 17:21 GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.ann
-rw-rw---- 1 rpatton ha_g_grp 3.0G Jul 25 17:21 GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.bwt
-rw-rw---- 1 rpatton ha_g_grp  915 Jul 14 15:22 GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.fai
-rw-rw---- 1 rpatton ha_g_grp 744M Jul 25 17:21 GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.pac
-rw-rw---- 1 rpatton ha_g_grp 1.5G Jul 25 17:42 GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.sa
drwxrws--- 2 rpatton ha_g_grp   90 Jul 13 15:00 ncbi-genomes-2022-07-13

Edit: Accidentally posted as separate reply initially whoops

ADD REPLY
0
Entering edit mode

Just to be sure, in your comment the path is:

/fh/fast/ha_g/grp/reference/T2T-CHM13/RefSeq_v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna

and in the toplevel bwa mem command it was:

/grp/reference/T2T-CHM13/RefSeq_v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna

...is it a typo that /fh/fast/ha_g/ is missing?

ADD REPLY
0
Entering edit mode

Fair question. I initially truncated the path in my post for privacy reasons but then was too lazy to remove it for the full ls copy-paste.

We were originally running this all through our usual snakemake pipeline so path integrity is one issue we can disregard with a fair amount of confidence. Snakemake would throw an error for a file not existing before the alignment job was even submitted to our HPC.

ADD REPLY
0
Entering edit mode

Oh yes, a memory issue sounds plausible. I'm not the one who's been running the indexing so I'm not sure if they did it on our HPC or locally.

I'm making a fresh /dir/ and deleting the current .fai to eliminate possible silly mistakes. Will update with log files when I have them. Thanks AT :)

ADD REPLY
0
Entering edit mode

Update: I ran the indexing and didn't have any errors but it didn't output a new .fai file either. It did seemingly update/overwrite the other files though.

Maybe me deleting the old .fai file was a problem? I deleted it in case the issue was BWA not wanting to overwrite the existing .fai file (which was only 915 bytes so definitely not a full index file)

Here's ls -lh again:

total 5.7G
-rw-r----- 1 athorpe ha_g_grp 3.0G Aug  2 16:38 GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
-rw-rw---- 1 athorpe ha_g_grp   16 Aug  2 17:50 GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.amb
-rw-rw---- 1 athorpe ha_g_grp 2.7K Aug  2 17:50 GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.ann
-rw-rw---- 1 athorpe ha_g_grp 3.0G Aug  2 17:48 GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.bwt
-rw-rw---- 1 athorpe ha_g_grp 744M Aug  2 17:50 GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.pac
-rw-rw---- 1 athorpe ha_g_grp 1.5G Aug  2 18:08 GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.sa
drwxrws--- 2 athorpe ha_g_grp    3 Aug  2 16:26 ncbi-genomes-2022-07-13

And here's the indexing log (I think all the version changes are for other modules in our pipeline but I'm including it in case it's relevant):

The following have been reloaded with a version change:
  1) FFTW/3.3.8-gompi-2019b => FFTW/3.3.8-gompi-2018b
  2) GCC/8.3.0 => GCC/7.3.0-2.30
  3) GCCcore/8.3.0 => GCCcore/7.3.0
  4) OpenBLAS/0.3.7-GCC-8.3.0 => OpenBLAS/0.3.1-GCC-7.3.0-2.30
  5) OpenMPI/3.1.4-GCC-8.3.0 => OpenMPI/3.1.1-GCC-7.3.0-2.30
  6) ScaLAPACK/2.0.2-gompi-2019b => ScaLAPACK/2.0.2-gompi-2018b-OpenBLAS-0.3.1
  7) XZ/5.2.4-GCCcore-8.3.0 => XZ/5.2.4-GCCcore-7.3.0
  8) binutils/2.32-GCCcore-8.3.0 => binutils/2.30-GCCcore-7.3.0
  9) foss/2019b => foss/2018b
 10) gompi/2019b => gompi/2018b
 11) hwloc/1.11.12-GCCcore-8.3.0 => hwloc/1.11.10-GCCcore-7.3.0
 12) libpciaccess/0.14-GCCcore-8.3.0 => libpciaccess/0.14-GCCcore-7.3.0
 13) libxml2/2.9.9-GCCcore-8.3.0 => libxml2/2.9.8-GCCcore-7.3.0
 14) numactl/2.0.12-GCCcore-8.3.0 => numactl/2.0.11-GCCcore-7.3.0
 15) zlib/1.2.11-GCCcore-8.3.0 => zlib/1.2.11-GCCcore-7.3.0

The following have been reloaded with a version change:
  1) GCCcore/7.3.0 => GCCcore/8.3.0
  2) XZ/5.2.4-GCCcore-7.3.0 => XZ/5.2.4-GCCcore-8.3.0
  3) zlib/1.2.11-GCCcore-7.3.0 => zlib/1.2.11-GCCcore-8.3.0

To execute picard run: java -jar $EBROOTPICARD/picard.jar
[bwa_index] Pack FASTA... 23.75 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=6234551002, availableWord=450685848
[BWTIncConstructFromPacked] 10 iterations done. 99999994 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 199999994 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 299999994 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 399999994 characters processed.
...
[BWTIncConstructFromPacked] 660 iterations done. 6158781338 characters processed.
[BWTIncConstructFromPacked] 670 iterations done. 6186419978 characters processed.
[BWTIncConstructFromPacked] 680 iterations done. 6210981258 characters processed.
[BWTIncConstructFromPacked] 690 iterations done. 6232807322 characters processed.
[bwt_gen] Finished constructing BWT in 691 iterations.
[bwa_index] 2406.75 seconds elapse.
[bwa_index] Update BWT... 16.80 sec
[bwa_index] Pack forward-only FASTA... 18.11 sec
[bwa_index] Construct SA from BWT and Occ... 1030.24 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index /fh/scratch/delete90/ha_g/TAN_T2T-CHM13_realignment/bwa_indexing_test/RefSeq_v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
[main] Real time: 3842.978 sec; CPU: 3495.671 sec
ADD REPLY
0
Entering edit mode

So has the error gone away after reindexing?

ADD REPLY
0
Entering edit mode

No, now I have no .fai file at all.

ADD REPLY
1
Entering edit mode

I am running indexing with the 2.0 CHM file. Will update here once the job is done.

ADD REPLY
0
Entering edit mode

Indexing completed. Here are the files produced

$ du -shc GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz*
890M    GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
512     GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz.amb
512     GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz.ann
3.0G    GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz.bwt
744M    GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz.pac
1.5G    GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz.sa
6.0G    total 

bwa mem was able to use these index files to create a SAM file. No problem like one you report. Running VN:0.7.17-r1188. Just did a simple bwa mem GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz read1.fq.gz read2.fq.gz for this test.

ADD REPLY
1
Entering edit mode

Same here. Worked for me without problem. The fai is not a bwa index file, it is from faidyx (samtools) so you don’t need that.

ADD REPLY
1
Entering edit mode

Update: I ran bwa mem on my test data last night to make sure my problem was solved. Everything went smoothly.

I mistakenly assumed that .fai was the index file for .fa (like .bam -> .bai) and was hoping that was the issue. The other files in the directory seem to be more or less unchanged, so I am still not sure what was wrong with the original indexing, or what specific error my other colleagues encountered. Maybe a clean dir and re-indexing was all we needed.

Thank you AT & Max for the advice & sanity checks :)

ADD REPLY
1
Entering edit mode
3.2 years ago
abbey ▴ 210

Conclusions: My problem was likely a memory error or other technical error during the initial indexing. Completely rerunning the indexing worked.

ADD COMMENT

Login before adding your answer.

Traffic: 2608 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6