Question: BAM format file to FASTA alignment file
gravatar for ben.ward
5.6 years ago by
United Kingdom
ben.ward40 wrote:


I've been given a reference genome and BAM file(s) which are reads aligned to that reference. This will be the first time I've done anything with BAM/SAM formats. I'd like to know how I would go about generating for each BAM file, the full sequence, using the reads in the BAM file and the reference they are aligned to, essentially resulting in a FASTA format alignment of the reference, and then each sequence represented by each BAM file. What are the considerations? For example at the top of my head, whether reads are paired end and their strandedness matters.

I'd like to know if this can be done in R / Bioconductor as I have some familiarity with it - mostly using Biostrings.



alignment sam bam R fasta • 6.3k views
ADD COMMENTlink modified 5.6 years ago by Devon Ryan96k • written 5.6 years ago by ben.ward40

I re-read your question and I think you actually want a fasta format alignment.  You actually probably don't want that.  You are new to BAM/SAM and so you might not realize yet that it would be pretty awful having it in a different alignment format.  Take a look at the actual file or the first several lines and just see what the file means before moving onto something more familiar like fasta.

e.g., samtools view file.bam | head -n 999 | column -t | less -S

ADD REPLYlink written 5.6 years ago by Lee Katz3.0k
gravatar for Devon Ryan
5.6 years ago by
Devon Ryan96k
Freiburg, Germany
Devon Ryan96k wrote:

The simplest method for generating the consensus sequence is to simply use bcftools consensus. An example is given in the documentation. Obviously, you'll need to have called the variants. GATK has a similar tool.

ADD COMMENTlink written 5.6 years ago by Devon Ryan96k

In samtools v1.1, you can use samtools bam2fq.

Usage:   samtools bam2fq [-nO] [-s <outSE.fq>] <in.bam>

Options: -n        don't append /1 and /2 to the read name
         -O        output quality in the OQ tag if present
         -s FILE   write singleton reads to FILE [assume single-end]
ADD REPLYlink written 5.6 years ago by Lee Katz3.0k

I've interpreted OP as requesting the consensus sequence rather than reproducing fastq files.

ADD REPLYlink written 5.6 years ago by Devon Ryan96k

Bcftools consensus command does:

Create consensus sequence by applying VCF variants to a reference fasta file

so I think that it is far from being straightforward to use this tool for obtaining the consensus of mapped reads which does not contain bases from the reference sequence - which I believe is what OP was looking for. I'm now solving the same task (to reconstruct consensus of mapped reads and not reference sequence modified by VCF variants). I'm using samtools mpileup + bcftools:

samtools mpileup -ABuf reference.nt mapped.bam | bcftools call -cOz --pval-threshold 0.99 > mapped.vcf.gz
tabix mapped.vcf.gz 
cat reference.nt | bcftools consensus mapped.vcf.gz > mapped.fastq

... and the only consensus I can get still contains nucleotides from reference at some positions. In case you could suggest a specific commands or parameters how to use bcftools consensus to generate read consensus that would be really helpful.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by al-ash140
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1480 users visited in the last hour