Recreate reference fasta from bam file
2
1
Entering edit mode
4.1 years ago

To make a long story short, I have a bam file that is aligned to the human reference plus a specific contig (human gene + viral vector introduced into a mouse).

The fasta file that was used for alignment got inadvertently deleted, but I have good coverage across the contig. Is there a straightforward way to use the reads aligning to that contig to recreate the contig's fasta file? (All the info should be there!)

bam fasta reference • 2.1k views
0
Entering edit mode

Interesting question! I'd love to see what people think of this!

0
Entering edit mode

reformat.sh in=your.bam out=fasta.fa (from BBMap suite) will create a fasta file. I am not sure if it will be the original reference you wish to recover. You could test with a small(ish) example.

0
Entering edit mode

That gives each read in fasta format, not the entire contig.I guess that could be assembled in subsequent steps or something..

0
Entering edit mode

You are right. Sorry a mental lapse on my part.

Do you have the index for what ever aligner that was used? I wonder if there is a way to recreate the fasta from it.

0
Entering edit mode

Nope - the directory with the fasta and the indices was nuked. FWIW, I'm pretty sure that I can recreate this fastq by digging up the vector and gene sequences and spending an hour or two doing manual surgery. It just seems like there oughta be a better way!

2
Entering edit mode
4.1 years ago
microfuge ★ 1.9k

Although reassembly seems the preferred option I thought if Pilon which polishes a reference with standard illumina reads could work . Just for fun I tried following with an ecoli MG1655 bam file.

• Got the length and name of reference chromosome from bam header (in this case only one)
• Created a fake reference with the same name and same number of nucleotides (but all nucleotides were A ; I tried with N and it did not work)
• Then use pilon and gave the input bam and the fake reference (composed of all A nucleotides) ( I gave as a single end bam but paired should produce better results)

• Pilon generated a fasta (pilon.fasta default name)
• Ran blast2sequence (NCBI;Online) with actual ecoli MG1655 genome)
• Following were the results (99% coverage and 99% identity)
• Dot Plot (https://ibb.co/ifFzjS)

- Alignment details (https://ibb.co/hNCejS)

0
Entering edit mode

Thanks for that test. May be useful to @Chris.

Problem is one can't be 100% sure (which you have confirmed) that you recovered the exact fasta genome used originally. But this is close enough.

0
Entering edit mode

really nice solution!

1
Entering edit mode
4.1 years ago
h.mon 34k

The easiest I can think of is extract the reads from the region and reassemble, but this would recreate a "wrong" reference.

Another option would be to recreate the region using the CIGAR field. Of course, this has already been implemented by Pierre a long time ago - the surprise here is it isn't java. This implementation apparently works for simple CIGAR.

There is another thread discussing the issue ( Is it possible to reconstruct alignment from CIGAR and MD strings alone? ), where Istvam mentions a thread announcing sam2pariwise - maybe it will work for you? This thread indicates that the CIGAR + MD are required for unambiguous reference reconstruction.