Tutorial:Generating consensus sequence from bam file
0
37
Entering edit mode
2.3 years ago

One of the recurring questions on biostars is "How can I create a consensus sequence from my bam file?" A variation of these question is "How to get fasta out of bam file?".

The rational behind this question is usually, to get the sequence for a given region of interest of my sample, so that one can do a multiple alignment between different samples/species.

The answer to this question is a simple two-step workflow:

1. Call variants
2. Include the variants into the reference sequence

Do the variant calling step with your favorite program/workflow, e.g. with bcftools:

$bcftools mpileup -Ou -f ref.fa input.bam | bcftools call -Ou -mv | bcftools norm -f ref.fa -Oz -o output.vcf.gz  In the end, one needs a valid, sorted vcf file which is compressed with bgzip and indexed with tabix. I also recommend to normalize your vcf file (In the above command this is done by bcftools norm). If your vcf isn't already compressed you can do this with: $ bgzip -c output.vcf > output.vcf.gz


And indexing is done by:

$tabix output.vcf.gz  Now we are ready to create our consensus sequence. The tool of choice is bcftools consensus. If one like to create a complete new reference genome, based on the called variants, this is done by: $ bcftools consensus -f ref.fa output.vcf.gz > out.fa


If you are interested in only a given region:

$samtools faidx ref.fa 8:11870-11890 | bcftools consensus output.vcf.gz -o out.fa  You can also create consensus sequences for multiple regions, if you provide a bed file to samtools faidx: $ samtools faidx -r regions.bed ref.fa  | bcftools consensus output.vcf.gz -o out.fa


Have fun :)

fin swimmer

Edit by GenoMax: Added a missing - in first bcftools norm command.

bam consensus fasta Tutorial • 4.8k views
1
Entering edit mode

Thanks for this! I have a question about tweaking the commands to have the consensus sequence report a "wobble" base in cases of heterozygosity, for example an R would be reported where a SNP is heterozygous for G/A. Do you think this is possible to implement in your pipeline? I'm writing in the context of healthy somatic human genomes (diploids).

0
Entering edit mode

According to the manual this should be possible with the -I argument:

-I, --iupac-codes
output variants in the form of IUPAC ambiguity codes

1
Entering edit mode

This seems like a very long-winded solution and doesn't apply if you don't have a reference to hand.

However that said, if you don't then your aligned BAM was probably aligned against a de-novo assembly and somewhere there should be a consensus lurking around.

I've generally done some trivial counting on the mpileup output, line by line, to call the consensus directly without going via vcf to start with and without needing a reference file. I should probably write such a tool and add it to samtools as it's a recurring question and can be done MUCH faster than needing to use bcftools.

1
Entering edit mode

Hello jkbonfield ,

I've generally done some trivial counting on the mpileup output, line by line, to call the consensus directly without going via vcf to start with and without needing a reference file.

• How can you produce a mpileup if you don't have a reference?
• Trivial counting might work if you have perfect reads and simple regions. But finding out what is the real sequence for your sample is much more that just counting and hence you are doing variant calling.

I should probably write such a tool and add it to samtools.

I would always recommend if there is a widely used tool to use it instead of writing your own. It is very likely that one introduce much more errors by writing its own tool. If I'm right, you are one of the active contributors to samtools? So this might be an exception.

fin swimmer

1
Entering edit mode

"samtools mpileup foo.bam" will produce the columns showing what bases are aligned at each position. The reference column will be N, but that's irrelevant if you want the consensus (unless you're planning on using imputation to use reference in zero coverage regions, which can be useful in some situations).

As for not writing my own software - well that's what I do for a living (and yes much of it ends up in htslib and samtools). :-) There may well be tools out there already to generate consensus; bound to be infact, including my own hacky awk and perl one-liners, but having everything in the same package is nice for usability and discoverability.

1
Entering edit mode

Thank you for this! I am trying to get the mtGenome sequence for my samples to do a MSA, but maybe I'm not understanding what the output for this should look like. When I try to get the consensus for my chrM (mtGenome), the fasta file only contains 1 sequence. I thought it was supposed to get the mtGenome sequences of all my samples in a single fasta file, right? I don't know what I'm doing wrong :(

2
Entering edit mode

As the title suggests you get a consensus i.e. one sequence. When compared to the reference you started with this consensus will have SNP's where there are differences between your data and the reference.

You will need to generate a consensus independently for each sample if you have more than one sample.

0
Entering edit mode

Oh, since I never saw the -sample flag with the instructions I've seen around, I assumed that it was able to deal with a multi sample VCF without it and it would do for all the samples! Thanks for clarifying that.

2
Entering edit mode

It may be possible to do so but the example shown in tutorial is for a single sample VCF.

1
Entering edit mode

Hi @zjamal.msbi19rcms, thanks. I would like to double check what the Oz does in the bcftools norm -f ref.fa Oz -o output.vcf.gz part. It's saying "no such file Oz", so isn't it meant to be a flag? as in -Oz?

2
Entering edit mode

Use a - before Oz. Looks like it s missing in the command above. We will confirm and update. Update made.

0
Entering edit mode

HI GenoMax oddly now when using the -Oz I am getting the error:

[mpileup] 1 samples in 1 input files
[E::fai_build3] Failed to open the FASTA file -Oz
Failed to load the fai index: -Oz


Why does it think -Oz is fasta file?

EDIT: Ah, I think after the -f there needs to be the ref.fa, since the -f flag requires a FASTA file after it

1
Entering edit mode

Have you indexed your fasta reference file?

0
Entering edit mode

Hey GenoMax , it worked after I just put the ref.fa after the -Oz flag

0
Entering edit mode

First of all thank you very much!! I'm having trouble with indexing my fasta file and I'm sure I have done something wrong because when I run mpileup I get this message "[E::fai_fetch] fai_fetch failed : unexpected end of file" or "[fai_fetch_seq] The sequence "CHRII" " not found. I do see that a .fai file is created when I run the command. From what I see, the reference genome has to be in fasta format, right? Does anyone know what could be happening? Thanks for any advice!!

0
Entering edit mode

Are you referring to the consensus fasta file that was produced or original reference? unexpected end of file would suggest that you may have a corrupt/truncated file.

0
Entering edit mode

The original reference

0
Entering edit mode

You can create an index by doing samtools faidx original.fa. If that generates an error then your fasta file may have some formatting issues.

0
Entering edit mode

Yes, I think my fasta has a problem (maybe not the same line size?) because I have tried with a different reference sequence and everything works. Thank you very much!!

0
Entering edit mode

Hi, I've managed to solve this problem and everything works right now. However, no variants apply when I run mpileup and consensus when I know that the strain I use has them. I think I'm missing something about how all of this works.

I explain it: when I align my sequences from NGS using bowtie (and using, let's say "reference genome 1"), I get a bam file in which I call variants using mpileup (and using reference genome 1). When running consensus, I also use reference genome 1. When doing this, no variants apply. However, if I "mix" reference genomes (i.e. using "reference genome 2" (from a totally different strain) when aligning and "reference genome 1" by variant calling, more than 2000 variants apply).

I see no point in "mixing" genomes so that variants apply, but I also cannot understand why option 1 doesn't work. Is there something I'm missing? Thank you very much!!

0
Entering edit mode

Hi finsswimmer ,

Thank you for this workflow. However, I am still unclear for the steps I should follow and has been unable to find an answer.

Question: I would like to generate a sequence for a gene of interest (means consensus of reads) from 10 bam files, which will give me 10 fasta files or all in 1 file with header: gene_bam1, gene_bam2......etc.

Tried:

1. Installation of BamBam to use Bam2consensus, but did not work out as it had not been updated since 2016.
2. Still most people use samtools mpileup but while using it, this comes: Note that using "samtools mpileup" to generate BCF or VCF files is now deprecated. To output these formats, please use "bcftools mpileup" instead.
3. I am unsure on commands for bcftools as this http://samtools.github.io/bcftools/howtos/consensus-sequence.html has not been helpful in case of extracting gene info.

Overall, would you suggest the workflow for my question considering latest info/parameters.

Thanks.

0
Entering edit mode

The method described above will give you 10 fasta files since the method is for one sample.

What are your BAM files from? Entire genome (preferable) or just the gene (not recommended, if sequencing was for whole genome). While you could subset your BAM'ss to get gene region you are interested in it may be simpler to just do the consensus for the genome and then fish out the region you need from consensus.

0
Entering edit mode

@GenoMax BAM files are from RNASeq. I would like to extract the sequence of few genes, translate to proteins and compare among multiple samples.

0
Entering edit mode

I tried bcftools consensus and IGV "copy consensus" option. It seems IGV gives the required output while bcftools give consensus while considering both sample and reference, so almost same for all samples for a gene.