Recreate reference fasta from bam file
2
1
Entering edit mode
4.1 years ago

To make a long story short, I have a bam file that is aligned to the human reference plus a specific contig (human gene + viral vector introduced into a mouse).

The fasta file that was used for alignment got inadvertently deleted, but I have good coverage across the contig. Is there a straightforward way to use the reads aligning to that contig to recreate the contig's fasta file? (All the info should be there!)

bam fasta reference • 2.1k views
ADD COMMENT
0
Entering edit mode

Interesting question! I'd love to see what people think of this!

ADD REPLY
0
Entering edit mode

reformat.sh in=your.bam out=fasta.fa (from BBMap suite) will create a fasta file. I am not sure if it will be the original reference you wish to recover. You could test with a small(ish) example.

ADD REPLY
0
Entering edit mode

That gives each read in fasta format, not the entire contig.I guess that could be assembled in subsequent steps or something..

ADD REPLY
0
Entering edit mode

You are right. Sorry a mental lapse on my part.

Do you have the index for what ever aligner that was used? I wonder if there is a way to recreate the fasta from it.

ADD REPLY
0
Entering edit mode

Nope - the directory with the fasta and the indices was nuked. FWIW, I'm pretty sure that I can recreate this fastq by digging up the vector and gene sequences and spending an hour or two doing manual surgery. It just seems like there oughta be a better way!

ADD REPLY
2
Entering edit mode
4.1 years ago
microfuge ★ 1.9k

Although reassembly seems the preferred option I thought if Pilon which polishes a reference with standard illumina reads could work . Just for fun I tried following with an ecoli MG1655 bam file.

  • Got the length and name of reference chromosome from bam header (in this case only one)
  • Created a fake reference with the same name and same number of nucleotides (but all nucleotides were A ; I tried with N and it did not work)
  • Then use pilon and gave the input bam and the fake reference (composed of all A nucleotides) ( I gave as a single end bam but paired should produce better results)

    • Pilon generated a fasta (pilon.fasta default name)
    • Ran blast2sequence (NCBI;Online) with actual ecoli MG1655 genome)
    • Following were the results (99% coverage and 99% identity)
    • Dot Plot (https://ibb.co/ifFzjS)

two - Alignment details (https://ibb.co/hNCejS) one

ADD COMMENT
0
Entering edit mode

Thanks for that test. May be useful to @Chris.

Problem is one can't be 100% sure (which you have confirmed) that you recovered the exact fasta genome used originally. But this is close enough.

ADD REPLY
0
Entering edit mode

really nice solution!

ADD REPLY
1
Entering edit mode
4.1 years ago
h.mon 34k

The easiest I can think of is extract the reads from the region and reassemble, but this would recreate a "wrong" reference.

Another option would be to recreate the region using the CIGAR field. Of course, this has already been implemented by Pierre a long time ago - the surprise here is it isn't java. This implementation apparently works for simple CIGAR.

There is another thread discussing the issue ( Is it possible to reconstruct alignment from CIGAR and MD strings alone? ), where Istvam mentions a thread announcing sam2pariwise - maybe it will work for you? This thread indicates that the CIGAR + MD are required for unambiguous reference reconstruction.

ADD COMMENT

Login before adding your answer.

Traffic: 1543 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6