Tutorial:Generating consensus sequence from bam file
0
27
Entering edit mode
2.1 years ago

One of the recurring questions on biostars is "How can I create a consensus sequence from my bam file?" A variation of these question is "How to get fasta out of bam file?".

The rational behind this question is usually, to get the sequence for a given region of interest of my sample, so that one can do a multiple alignment between different samples/species.

The answer to this question is a simple two-step workflow:

  1. Call variants
  2. Include the variants into the reference sequence

Do the variant calling step with your favorite program/workflow, e.g. with bcftools:

$ bcftools mpileup -Ou -f ref.fa input.bam | bcftools call -Ou -mv | bcftools norm -f ref.fa Oz -o output.vcf.gz

In the end, one needs a valid, sorted vcf file which is compressed with bgzip and indexed with tabix.

I also recommend to normalize your vcf file (In the above command this is done by bcftools norm).

If your vcf isn't already compressed you can do this with:

$ bgzip -c output.vcf > output.vcf.gz

And indexing is done by:

$ tabix output.vcf.gz

Now we are ready to create our consensus sequence. The tool of choice is bcftools consensus.

If one like to create a complete new reference genome, based on the called variants, this is done by:

$ bcftools consensus -f ref.fa output.vcf.gz > out.fa

If you are interested in only a given region:

$ samtools faidx ref.fa 8:11870-11890 | bcftools consensus output.vcf.gz -o out.fa

You can also create consensus sequences for multiple regions, if you provide a bed file to samtools faidx:

$ samtools faidx -r regions.bed ref.fa  | bcftools consensus output.vcf.gz -o out.fa

Have fun :)

fin swimmer

bam fasta consensus Tutorial • 3.7k views
ADD COMMENT
0
Entering edit mode

Thanks for this! I have a question about tweaking the commands to have the consensus sequence report a "wobble" base in cases of heterozygosity, for example an R would be reported where a SNP is heterozygous for G/A. Do you think this is possible to implement in your pipeline? I'm writing in the context of healthy somatic human genomes (diploids).

ADD REPLY
0
Entering edit mode

According to the manual this should be possible with the -I argument:

-I, --iupac-codes
   output variants in the form of IUPAC ambiguity codes
ADD REPLY
0
Entering edit mode

This seems like a very long-winded solution and doesn't apply if you don't have a reference to hand.

However that said, if you don't then your aligned BAM was probably aligned against a de-novo assembly and somewhere there should be a consensus lurking around.

I've generally done some trivial counting on the mpileup output, line by line, to call the consensus directly without going via vcf to start with and without needing a reference file. I should probably write such a tool and add it to samtools as it's a recurring question and can be done MUCH faster than needing to use bcftools.

ADD REPLY
0
Entering edit mode

Hello jkbonfield ,

I've generally done some trivial counting on the mpileup output, line by line, to call the consensus directly without going via vcf to start with and without needing a reference file.

  • How can you produce a mpileup if you don't have a reference?
  • Trivial counting might work if you have perfect reads and simple regions. But finding out what is the real sequence for your sample is much more that just counting and hence you are doing variant calling.

I should probably write such a tool and add it to samtools.

I would always recommend if there is a widely used tool to use it instead of writing your own. It is very likely that one introduce much more errors by writing its own tool. If I'm right, you are one of the active contributors to samtools? So this might be an exception.

fin swimmer

ADD REPLY
0
Entering edit mode

"samtools mpileup foo.bam" will produce the columns showing what bases are aligned at each position. The reference column will be N, but that's irrelevant if you want the consensus (unless you're planning on using imputation to use reference in zero coverage regions, which can be useful in some situations).

As for not writing my own software - well that's what I do for a living (and yes much of it ends up in htslib and samtools). :-) There may well be tools out there already to generate consensus; bound to be infact, including my own hacky awk and perl one-liners, but having everything in the same package is nice for usability and discoverability.

ADD REPLY
0
Entering edit mode

Thank you for this! I am trying to get the mtGenome sequence for my samples to do a MSA, but maybe I'm not understanding what the output for this should look like. When I try to get the consensus for my chrM (mtGenome), the fasta file only contains 1 sequence. I thought it was supposed to get the mtGenome sequences of all my samples in a single fasta file, right? I don't know what I'm doing wrong :(

ADD REPLY
1
Entering edit mode

As the title suggests you get a consensus i.e. one sequence. When compared to the reference you started with this consensus will have SNP's where there are differences between your data and the reference.

You will need to generate a consensus independently for each sample if you have more than one sample.

ADD REPLY
0
Entering edit mode

Oh, since I never saw the -sample flag with the instructions I've seen around, I assumed that it was able to deal with a multi sample VCF without it and it would do for all the samples! Thanks for clarifying that.

ADD REPLY
1
Entering edit mode

It may be possible to do so but the example shown in tutorial is for a single sample VCF.

ADD REPLY

Login before adding your answer.

Traffic: 2876 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6