Convert Bam File To Fasta File
5
4
Entering edit mode
11.3 years ago

How can I convert bam file to fasta file using Biopython command? please share if some one have any idea.

conversion bam fasta biopython • 29k views
0
Entering edit mode

Hi Sazzad. Can you tell us a little more about your problem and why you are choosing to solve it using biopython?

0
Entering edit mode

It will be easy if you clear what you did so far!

14
Entering edit mode
11.3 years ago

You can do this with a combination of Biopython for writing the Fasta files and pysam for reading the BAM files:

import os
import sys

import pysam
from Bio import SeqIO, Seq, SeqRecord

def main(in_file):
out_file = "%s.fa" % os.path.splitext(in_file)[0]
with open(out_file, "w") as out_handle:
# Write records from the BAM file one at a time to the output file.
# Works lazily as BAM sequences are read so will handle large files.
SeqIO.write(bam_to_rec(in_file), out_handle, "fasta")

def bam_to_rec(in_file):
"""Generator to convert BAM files into Biopython SeqRecords.
"""
bam_file = pysam.Samfile(in_file, "rb")
seq = seq.reverse_complement()
rec = SeqRecord.SeqRecord(seq, read.qname, "", "")
yield rec

if __name__ == "__main__":
main(*sys.argv[1:])


Run it with:

% python bam_to_fasta.py your_file.bam


and the output will be in your_file.fa

0
Entering edit mode

This works well, thanks Brad. However I found a couple of reasons to consider Picard SamToFastq suggested by Steve Lianoglou instead.

SamToFastq will properly split paired-end reads into two fastq files.

SamToFastq seems to be much faster (about 1/5th the time in my informal testing).

0
Entering edit mode

One more thing to note: SamToFastq actually outputs FastQ with quality scores, while this solution outputs Fasta files (no qualities). I actually modified Brad's solution to output FastQ for my testing, since that is what I needed.

0
Entering edit mode

Lance, that makes good sense. Picard's SamToFastq is great; I even wrote a Galaxy wrapper for it in the community tool shed: http://community.g2.bx.psu.edu/tool/browse_tools?sort=name&f-state=approved&id=91f644c6d57806f8&webapp=community&operation=view_tool&f-Category.name=SAM

0
Entering edit mode

How to add sequencing quality and mapped or unmapped reads as parameters?

Thanks

11
Entering edit mode
11.3 years ago

Alternative approach using SAMtools (source: SeqAnswers)

samtools view filename.bam | \
awk '{OFS="\t"; print ">"$1"\n"$10}' - > filename.fasta

3
Entering edit mode

The problem with this is that the original strand of the sequence is lost, as SAM always reports sequences on the REF strand. To recreate the original sequence that was aligned, one needs to interrogate the FLAG field and ask if the read was actually aligned to the "-" strand. If so, it needs to be RC'ed. See line 6 of Brad's "bam_to_rec" function above.

0
Entering edit mode

As always, the usefulness of the approach depends on your intended use. But thanks for pointing that out, I didn't think of it.

0
Entering edit mode

I created an account to up vote this!

thanks for the help.

0
Entering edit mode

In certain cases, re-generating fastq is possible with this small modification: samtools view filename.bam | awk '{OFS="\t"; print "@"$1"\n"$10"\n+\n"\$11}' >| filename.fastq I have used this successfully with bam files generated by Tophat (on unstranded data).

4
Entering edit mode
11.3 years ago

If you're willing to consider using something that's not Python-maden, Picard comes with the SamToFastq utility you can use from the command line that works on SAM and BAM files.

0
Entering edit mode
11.3 years ago
Ben Lange ▴ 190

BAM is not in the list but OpenBabel seems like a pretty good converter application. Maybe someone can come up with a plugin if BAM is common enough.

0
Entering edit mode

but it is mostly designed to handle the chemical file formats.

0
Entering edit mode

I'm learning so let's see if I understand:

*BAM is an animated graphics file format. http://iesdp.gibberlings3.net/file_formats/ie_formats/bam_v1.htm

*Fasta is a bioinformatics specific format. http://en.wikipedia.org/wiki/FASTA_format

Doesn't it make more sense to go from Fasta to BAM than the other way around? If you had a GIF file it would be pretty tough to reverse that file back to a language that described what was going on in the image. Unless there's quite a bit of metadata built into the BAM file.

0
Entering edit mode

BAM - is compressed SAM.
SAM – Sequence Alignment/Map format, in which the results of the 1000 Genomes Project will be released.
FASTA – The FASTA file format, for sequence data. Sometimes also given as FNA or FAA (Fasta Nucleic Acid or Fasta Amino Acid).

Both seem to be oriented towards sequencing data. Why would FASTA get lumped in with chemical data but not SAM?

0
Entering edit mode
4.4 years ago

And in more recent versions of samtools, you could just use samtools fasta