Question: Recreate reference fasta from bam file
1
gravatar for Chris Miller
18 days ago by
Chris Miller19k
Washington University in St. Louis, MO
Chris Miller19k wrote:

To make a long story short, I have a bam file that is aligned to the human reference plus a specific contig (human gene + viral vector introduced into a mouse).

The fasta file that was used for alignment got inadvertently deleted, but I have good coverage across the contig. Is there a straightforward way to use the reads aligning to that contig to recreate the contig's fasta file? (All the info should be there!)

reference bam fasta • 260 views
ADD COMMENTlink modified 18 days ago by h.mon15k • written 18 days ago by Chris Miller19k

Interesting question! I'd love to see what people think of this!

ADD REPLYlink written 18 days ago by Ram15k

reformat.sh in=your.bam out=fasta.fa (from BBMap suite) will create a fasta file. I am not sure if it will be the original reference you wish to recover. You could test with a small(ish) example.

ADD REPLYlink written 18 days ago by genomax48k

That gives each read in fasta format, not the entire contig.I guess that could be assembled in subsequent steps or something..

ADD REPLYlink written 18 days ago by Chris Miller19k

You are right. Sorry a mental lapse on my part.

Do you have the index for what ever aligner that was used? I wonder if there is a way to recreate the fasta from it.

ADD REPLYlink modified 18 days ago • written 18 days ago by genomax48k

Nope - the directory with the fasta and the indices was nuked. FWIW, I'm pretty sure that I can recreate this fastq by digging up the vector and gene sequences and spending an hour or two doing manual surgery. It just seems like there oughta be a better way!

ADD REPLYlink written 18 days ago by Chris Miller19k
2
gravatar for microfuge
17 days ago by
microfuge800
microfuge800 wrote:

Although reassembly seems the preferred option I thought if Pilon which polishes a reference with standard illumina reads could work . Just for fun I tried following with an ecoli MG1655 bam file.

  • Got the length and name of reference chromosome from bam header (in this case only one)
  • Created a fake reference with the same name and same number of nucleotides (but all nucleotides were A ; I tried with N and it did not work)
  • Then use pilon and gave the input bam and the fake reference (composed of all A nucleotides) ( I gave as a single end bam but paired should produce better results)

    • Pilon generated a fasta (pilon.fasta default name)
    • Ran blast2sequence (NCBI;Online) with actual ecoli MG1655 genome)
    • Following were the results (99% coverage and 99% identity)
    • Dot Plot (https://ibb.co/ifFzjS)

two - Alignment details (https://ibb.co/hNCejS) one

ADD COMMENTlink modified 17 days ago by genomax48k • written 17 days ago by microfuge800

Thanks for that test. May be useful to @Chris.

Problem is one can't be 100% sure (which you have confirmed) that you recovered the exact fasta genome used originally. But this is close enough.

ADD REPLYlink modified 17 days ago • written 17 days ago by genomax48k

really nice solution!

ADD REPLYlink written 16 days ago by Chris Miller19k
1
gravatar for h.mon
18 days ago by
h.mon15k
Brazil
h.mon15k wrote:

The easiest I can think of is extract the reads from the region and reassemble, but this would recreate a "wrong" reference.

Another option would be to recreate the region using the CIGAR field. Of course, this has already been implemented by Pierre a long time ago - the surprise here is it isn't java. This implementation apparently works for simple CIGAR.

There is another thread discussing the issue ( Is it possible to reconstruct alignment from CIGAR and MD strings alone? ), where Istvam mentions a thread announcing sam2pariwise - maybe it will work for you? This thread indicates that the CIGAR + MD are required for unambiguous reference reconstruction.

ADD COMMENTlink modified 18 days ago • written 18 days ago by h.mon15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 652 users visited in the last hour