Question

Obtaining The Consensus Sequence From A Bam File In Fasta

0

Entering edit mode

11.1 years ago

Anima Mundi ★ 2.9k

Hello everybody,

I am dealing with BAM files for the first time, and I had some problems to handle them. My goal is to obtain the whole consensus sequence (between sequencing folds of a genomic range from the same sample) in FASTA format. My first approach has been to find a "ready to go" script/one line command (i. e. I had a look here Convert bam file to fasta file ) to make the conversion, but I failed. This is what I got trying to use Samtools:

[main_samview] fail to open "filename.bam" for reading.
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
[main_samview] fail to read the header from "R_2:164876791-164886791.bam".

I tried to figure out how a BAM file is structured, but having a look with a text editor I saw that the only explicit information are the non-assembled reads, and it would take a while to obtain the consensus starting from there (and, I guess, there would be no point anymore to use a BAM file).

As a second choice, I tried to use some more graphically oriented tools, such Artemis. Until now, I had problems even to make one of this program work (but this are hopefully physiological issues which I can probably solve myself); furthermore, I did not find a way to extract a consensus from one of the samples pointed here http://www.sanger.ac.uk/resources/software/artemis/ngs/

What would you suggest? Is there an easy way to obtain what I need, while I get more confident with the format? I hope the question is clear enough.

bam fasta consensus msa genome • 6.5k views

ADD COMMENT • link updated 10.5 years ago by Biostar 20 • written 11.1 years ago by Anima Mundi ★ 2.9k

score 1 · Answer 1 · 2013-03-22

The post you linked to is about someone doing something else. They aren't trying to make a single consensus file; they are trying to make a multi-fasta of every individual read.

There is no plug-and-play simple script that will take any .bam and make a perfect consensus fasta, because doing that requires making judgement calls about all the reads that don't perfectly match your consensus, of which there will be a lot.

But you can use samtools/bcftools to make an all-points vcf from your .bam, and then the vcf2fq command in vcfutils will try to make a consensus taking into account the called SNPs.

score 0 · Answer 2 · 2013-03-22

0

Entering edit mode

11.1 years ago

Istvan Albert 100k

You are using a file in the wrong format.

Make sure that you understand the difference between text formats like SAM and binary formats like BAM. Once you have a correct file the advice that you got can be applied directly.

ADD COMMENT • link 11.1 years ago by Istvan Albert 100k

0

Entering edit mode

So, if I understood well, my data are currently in SAM format. In fact, using Editra to display them shows tab-delimited fields which may correspond to the description fields of each read. Also, using the file command gives me this output: "ASCII text, with very long lines", so I am not dealing with binary files. Still, I don't see any header line and, by the way, trying to convert my files to BAM (samtools view -b -S myfile.sorted.bam) leads to this error: "[samopen] no @SQ lines in the header. [sam_read1] missing header? Abort!".

ADD REPLY • link 11.1 years ago by Anima Mundi ★ 2.9k

0

Entering edit mode

This was recently discussed right here How to extract unaligned sequences from BAM files obtainend from BWA

ADD REPLY • link 11.1 years ago by Istvan Albert 100k

0

Entering edit mode

Ok, thank you. I think this particular issue is solved; still I have some troubles, I will open a new question in case I am not able to solve them out.

ADD REPLY • link 11.1 years ago by Anima Mundi ★ 2.9k