One of the recurring questions on biostars is "How can I create a consensus sequence from my bam file?" A variation of these question is "How to get fasta out of bam file?".
The rational behind this question is usually, to get the sequence for a given region of interest of my sample, so that one can do a multiple alignment between different samples/species.
The answer to this question is a simple two-step workflow:
- Call variants
- Include the variants into the reference sequence
Do the variant calling step with your favorite program/workflow, e.g. with bcftools
:
$ bcftools mpileup -Ou -f ref.fa input.bam | bcftools call -Ou -mv | bcftools norm -f ref.fa -Oz -o output.vcf.gz
In the end, one needs a valid, sorted vcf
file which is compressed with bgzip
and indexed with tabix
.
I also recommend to normalize your vcf file (In the above command this is done by bcftools norm
).
If your vcf
isn't already compressed you can do this with:
$ bgzip -c output.vcf > output.vcf.gz
And indexing is done by:
$ tabix output.vcf.gz
Now we are ready to create our consensus sequence. The tool of choice is bcftools consensus.
If one like to create a complete new reference genome, based on the called variants, this is done by:
$ bcftools consensus -f ref.fa output.vcf.gz > out.fa
If you are interested in only a given region:
$ samtools faidx ref.fa 8:11870-11890 | bcftools consensus output.vcf.gz -o out.fa
You can also create consensus sequences for multiple regions, if you provide a bed
file to samtools faidx
:
$ samtools faidx -r regions.bed ref.fa | bcftools consensus output.vcf.gz -o out.fa
Have fun :)
fin swimmer
Edit by GenoMax: Added a missing -
in first bcftools norm
command.
Thanks for this! I have a question about tweaking the commands to have the consensus sequence report a "wobble" base in cases of heterozygosity, for example an R would be reported where a SNP is heterozygous for G/A. Do you think this is possible to implement in your pipeline? I'm writing in the context of healthy somatic human genomes (diploids).
According to the manual this should be possible with the
-I
argument:This seems like a very long-winded solution and doesn't apply if you don't have a reference to hand.
However that said, if you don't then your aligned BAM was probably aligned against a de-novo assembly and somewhere there should be a consensus lurking around.
I've generally done some trivial counting on the mpileup output, line by line, to call the consensus directly without going via vcf to start with and without needing a reference file. I should probably write such a tool and add it to samtools as it's a recurring question and can be done MUCH faster than needing to use bcftools.
Hello jkbonfield ,
I would always recommend if there is a widely used tool to use it instead of writing your own. It is very likely that one introduce much more errors by writing its own tool. If I'm right, you are one of the active contributors to samtools? So this might be an exception.
fin swimmer
"samtools mpileup foo.bam" will produce the columns showing what bases are aligned at each position. The reference column will be N, but that's irrelevant if you want the consensus (unless you're planning on using imputation to use reference in zero coverage regions, which can be useful in some situations).
As for not writing my own software - well that's what I do for a living (and yes much of it ends up in htslib and samtools). :-) There may well be tools out there already to generate consensus; bound to be infact, including my own hacky awk and perl one-liners, but having everything in the same package is nice for usability and discoverability.
Thank you for this! I am trying to get the mtGenome sequence for my samples to do a MSA, but maybe I'm not understanding what the output for this should look like. When I try to get the consensus for my chrM (mtGenome), the fasta file only contains 1 sequence. I thought it was supposed to get the mtGenome sequences of all my samples in a single fasta file, right? I don't know what I'm doing wrong :(
As the title suggests you get a
consensus
i.e. one sequence. When compared to the reference you started with this consensus will have SNP's where there are differences between your data and the reference.You will need to generate a consensus independently for each sample if you have more than one sample.
Oh, since I never saw the -sample flag with the instructions I've seen around, I assumed that it was able to deal with a multi sample VCF without it and it would do for all the samples! Thanks for clarifying that.
It may be possible to do so but the example shown in tutorial is for a single sample VCF.
Hi @zjamal.msbi19rcms, thanks. I would like to double check what the
Oz
does in thebcftools norm -f ref.fa Oz -o output.vcf.gz
part. It's saying "no such file Oz", so isn't it meant to be a flag? as in-Oz
?Use a
-
beforeOz
. Looks like it s missing in the command above.We will confirm and update.Update made.HI GenoMax oddly now when using the
-Oz
I am getting the error:Why does it think -Oz is fasta file?
EDIT: Ah, I think after the
-f
there needs to be theref.fa
, since the-f
flag requires a FASTA file after itHave you indexed your fasta reference file?
Hey GenoMax , it worked after I just put the ref.fa after the -Oz flag
First of all thank you very much!! I'm having trouble with indexing my fasta file and I'm sure I have done something wrong because when I run
mpileup
I get this message "[E::fai_fetch] fai_fetch failed : unexpected end of file" or "[fai_fetch_seq] The sequence "CHRII" " not found. I do see that a.fai
file is created when I run the command. From what I see, the reference genome has to be in fasta format, right? Does anyone know what could be happening? Thanks for any advice!!Are you referring to the consensus fasta file that was produced or original reference?
unexpected end of file
would suggest that you may have a corrupt/truncated file.The original reference
You can create an index by doing
samtools faidx original.fa
. If that generates an error then your fasta file may have some formatting issues.Yes, I think my fasta has a problem (maybe not the same line size?) because I have tried with a different reference sequence and everything works. Thank you very much!!
Hi, I've managed to solve this problem and everything works right now. However, no variants apply when I run
mpileup
andconsensus
when I know that the strain I use has them. I think I'm missing something about how all of this works.I explain it: when I align my sequences from NGS using
bowtie
(and using, let's say "reference genome 1"), I get a bam file in which I call variants usingmpileup
(and using reference genome 1). When runningconsensus
, I also use reference genome 1. When doing this, no variants apply. However, if I "mix" reference genomes (i.e. using "reference genome 2" (from a totally different strain) when aligning and "reference genome 1" by variant calling, more than 2000 variants apply).I see no point in "mixing" genomes so that variants apply, but I also cannot understand why option 1 doesn't work. Is there something I'm missing? Thank you very much!!
Hi finsswimmer ,
Thank you for this workflow. However, I am still unclear for the steps I should follow and has been unable to find an answer.
Question: I would like to generate a sequence for a gene of interest (means consensus of reads) from 10 bam files, which will give me 10 fasta files or all in 1 file with header: gene_bam1, gene_bam2......etc.
Tried:
Note that using "samtools mpileup" to generate BCF or VCF files is now deprecated. To output these formats, please use "bcftools mpileup" instead.
Overall, would you suggest the workflow for my question considering latest info/parameters.
Thanks.
The method described above will give you 10 fasta files since the method is for one sample.
What are your BAM files from? Entire genome (preferable) or just the gene (not recommended, if sequencing was for whole genome). While you could subset your BAM'ss to get gene region you are interested in it may be simpler to just do the consensus for the genome and then fish out the region you need from consensus.
@GenoMax BAM files are from RNASeq. I would like to extract the sequence of few genes, translate to proteins and compare among multiple samples.
I tried bcftools consensus and IGV "copy consensus" option. It seems IGV gives the required output while bcftools give consensus while considering both sample and reference, so almost same for all samples for a gene.