Question

Is it possible to extract the reference sequence from the BWA index files?

3

Entering edit mode

8.2 years ago

Alex Richter ▴ 210

I have a series of BWA index files ( *.amb *.ann *.bwt, etc) but the original reference fasta sequences are not available. Assuming that it's a complete burrows-wheeler index, the complete reference should be encoded in those indices.

Does anyone know if/how the fasta can be re-extracted from them?

alignment bwa • 3.9k views

ADD COMMENT • link updated 9 months ago by evolighting ▴ 20 • written 8.2 years ago by Alex Richter ▴ 210

0

Entering edit mode

see Reference File From Bam

ADD REPLY • link 8.2 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

9 months ago

evolighting ▴ 20

It's been a long time, but someone asked me the same question recently, so maybe I can answer it.

Get fasta file from index could be done, as original sequeces is needed at process such as " Generate CIGAR".

bns_get_seq, bns_fetch_seq in bntseq.c is what you want;

And it could be done with scritpt,using the same logic as bwa;

check my test:

$ python3 pac2fasta.py test.fa

9324aced81ecf188650e516992749a48  test.fa.origin
9324aced81ecf188650e516992749a48  test.fa

https://github.com/evolighting/pac2fasta

ADD COMMENT • link 9 months ago by evolighting ▴ 20

Ram · Accepted Answer · 2016-02-16

6

Entering edit mode

8.2 years ago

matted 7.8k

This is an interesting question. It definitely should be possible, so I guess the question may be how to do it with existing tools and the least amount of pain.

I looked into it, and it seems that from the original bwtsw code (that bwa relies on, particularly for the indexing steps), the "packed" fasta file (pac) is the key file to use. This is supported by looking at the order of the indexing substeps of fa2pac, pac2bwt, and finally bwt2sa.

So as an informed guess I searched for "pac2fasta", and (surprisingly to me) found an existing utility in the TMAP package. I believe, based on some quick tests, that the binary format for their pac file is the same as bwa uses (and bwtsw), aside from some embedded version numbers. The only catch is that TMAP has a binary annotation file ($ref.tmap.anno) that stores the chromosome names and lengths, whereas bwa uses a plaintext annotation file ($ref.ann). So if this is really what you want to do, it looks like you'll have to hardcode things (that's what I did to test with a small single chromosome reference) or write some code to make a TMAP annotation file from a bam header or bwa .ann file (by working through tmap_refseq_write_header in tmap_refseq.c).

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by matted 7.8k

0

Entering edit mode

Thanks! Since I hopefully won't need to do this much, I'll do what you suggest, and munge in the annotations.

ADD REPLY • link 8.2 years ago by Alex Richter ▴ 210

0

Entering edit mode

Hi Alex, did you find a way to get this done? I also have the same problem. Thanks.

ADD REPLY • link 8.0 years ago by xieshaojun0621 ▴ 210

0

Entering edit mode

Hi! Did any of you succeed? I would like your help in doing so either. Thanks!

ADD REPLY • link 7.1 years ago by shiranos • 0

0

Entering edit mode

tmap .pac generate by tmap index seen to be double the size, which I think have both forward sequence and it's complementary sequence, as tmap pac2fasta print. tmap $ref.tmap.anno, I think is bwa .amb+.ann; but they do share the same way convert base to "concatenated 2-bit encoded sequence" as comment in bwa source code, bwa.c.

ADD REPLY • link 9 months ago by evolighting ▴ 20