Question: Recreate reference fasta from bam file
1
gravatar for Chris Miller
3 months ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

To make a long story short, I have a bam file that is aligned to the human reference plus a specific contig (human gene + viral vector introduced into a mouse).

The fasta file that was used for alignment got inadvertently deleted, but I have good coverage across the contig. Is there a straightforward way to use the reads aligning to that contig to recreate the contig's fasta file? (All the info should be there!)

reference bam fasta • 376 views
ADD COMMENTlink modified 3 months ago by h.mon18k • written 3 months ago by Chris Miller20k

Interesting question! I'd love to see what people think of this!

ADD REPLYlink written 3 months ago by Ram17k

reformat.sh in=your.bam out=fasta.fa (from BBMap suite) will create a fasta file. I am not sure if it will be the original reference you wish to recover. You could test with a small(ish) example.

ADD REPLYlink written 3 months ago by genomax54k

That gives each read in fasta format, not the entire contig.I guess that could be assembled in subsequent steps or something..

ADD REPLYlink written 3 months ago by Chris Miller20k

You are right. Sorry a mental lapse on my part.

Do you have the index for what ever aligner that was used? I wonder if there is a way to recreate the fasta from it.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax54k

Nope - the directory with the fasta and the indices was nuked. FWIW, I'm pretty sure that I can recreate this fastq by digging up the vector and gene sequences and spending an hour or two doing manual surgery. It just seems like there oughta be a better way!

ADD REPLYlink written 3 months ago by Chris Miller20k
2
gravatar for microfuge
3 months ago by
microfuge930
microfuge930 wrote:

Although reassembly seems the preferred option I thought if Pilon which polishes a reference with standard illumina reads could work . Just for fun I tried following with an ecoli MG1655 bam file.

  • Got the length and name of reference chromosome from bam header (in this case only one)
  • Created a fake reference with the same name and same number of nucleotides (but all nucleotides were A ; I tried with N and it did not work)
  • Then use pilon and gave the input bam and the fake reference (composed of all A nucleotides) ( I gave as a single end bam but paired should produce better results)

    • Pilon generated a fasta (pilon.fasta default name)
    • Ran blast2sequence (NCBI;Online) with actual ecoli MG1655 genome)
    • Following were the results (99% coverage and 99% identity)
    • Dot Plot (https://ibb.co/ifFzjS)

two - Alignment details (https://ibb.co/hNCejS) one

ADD COMMENTlink modified 3 months ago by genomax54k • written 3 months ago by microfuge930

Thanks for that test. May be useful to @Chris.

Problem is one can't be 100% sure (which you have confirmed) that you recovered the exact fasta genome used originally. But this is close enough.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax54k

really nice solution!

ADD REPLYlink written 3 months ago by Chris Miller20k
1
gravatar for h.mon
3 months ago by
h.mon18k
Brazil
h.mon18k wrote:

The easiest I can think of is extract the reads from the region and reassemble, but this would recreate a "wrong" reference.

Another option would be to recreate the region using the CIGAR field. Of course, this has already been implemented by Pierre a long time ago - the surprise here is it isn't java. This implementation apparently works for simple CIGAR.

There is another thread discussing the issue ( Is it possible to reconstruct alignment from CIGAR and MD strings alone? ), where Istvam mentions a thread announcing sam2pariwise - maybe it will work for you? This thread indicates that the CIGAR + MD are required for unambiguous reference reconstruction.

ADD COMMENTlink modified 3 months ago • written 3 months ago by h.mon18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1619 users visited in the last hour