Tutorial: Generating consensus sequence from bam file
21
gravatar for finswimmer
8 months ago by
finswimmer12k
Germany
finswimmer12k wrote:

One of the recurring questions on biostars is "How can I create a consensus sequence from my bam file?" A variation of these question is "How to get fasta out of bam file?".

The rational behind this question is usually, to get the sequence for a given region of interest of my sample, so that one can do a multiple alignment between different samples/species.

The answer to this question is a simple two-step workflow:

  1. Call variants
  2. Include the variants into the reference sequence

Do the variant calling step with your favorite program/workflow, e.g. with bcftools:

$ bcftools mpileup -Ou -f ref.fa input.bam | bcftools call -Ou -mv | bcftools norm -f ref.fa Oz -o output.vcf.gz

In the end, one needs a valid, sorted vcf file which is compressed with bgzip and indexed with tabix.

I also recommend to normalize your vcf file (In the above command this is done by bcftools norm).

If your vcf isn't already compressed you can do this with:

$ bgzip -c output.vcf > output.vcf.gz

And indexing is done by:

$ tabix output.vcf.gz

Now we are ready to create our consensus sequence. The tool of choice is bcftools consensus.

If one like to create a complete new reference genome, based on the called variants, this is done by:

$ bcftools consensus -f ref.fa output.vcf.gz > out.fa

If you are interested in only a given region:

$ samtools faidx ref.fa 8:11870-11890 | bcftools consensus output.vcf.gz -o out.fa

You can also create consensus sequences for multiple regions, if you provide a bed file to samtools faidx:

$ samtools faidx -r regions.bed ref.fa  | bcftools consensus output.vcf.gz -o out.fa

Have fun :)

fin swimmer

bam tutorial consensus fasta • 939 views
ADD COMMENTlink modified 8 weeks ago • written 8 months ago by finswimmer12k

Thanks for this! I have a question about tweaking the commands to have the consensus sequence report a "wobble" base in cases of heterozygosity, for example an R would be reported where a SNP is heterozygous for G/A. Do you think this is possible to implement in your pipeline? I'm writing in the context of healthy somatic human genomes (diploids).

ADD REPLYlink written 9 weeks ago by joe120

According to the manual this should be possible with the -I argument:

-I, --iupac-codes
   output variants in the form of IUPAC ambiguity codes
ADD REPLYlink modified 8 weeks ago • written 9 weeks ago by finswimmer12k

This seems like a very long-winded solution and doesn't apply if you don't have a reference to hand.

However that said, if you don't then your aligned BAM was probably aligned against a de-novo assembly and somewhere there should be a consensus lurking around.

I've generally done some trivial counting on the mpileup output, line by line, to call the consensus directly without going via vcf to start with and without needing a reference file. I should probably write such a tool and add it to samtools as it's a recurring question and can be done MUCH faster than needing to use bcftools.

ADD REPLYlink written 8 weeks ago by jkbonfield200

Hello jkbonfield ,

I've generally done some trivial counting on the mpileup output, line by line, to call the consensus directly without going via vcf to start with and without needing a reference file.

  • How can you produce a mpileup if you don't have a reference?
  • Trivial counting might work if you have perfect reads and simple regions. But finding out what is the real sequence for your sample is much more that just counting and hence you are doing variant calling.

I should probably write such a tool and add it to samtools.

I would always recommend if there is a widely used tool to use it instead of writing your own. It is very likely that one introduce much more errors by writing its own tool. If I'm right, you are one of the active contributors to samtools? So this might be an exception.

fin swimmer

ADD REPLYlink written 8 weeks ago by finswimmer12k

"samtools mpileup foo.bam" will produce the columns showing what bases are aligned at each position. The reference column will be N, but that's irrelevant if you want the consensus (unless you're planning on using imputation to use reference in zero coverage regions, which can be useful in some situations).

As for not writing my own software - well that's what I do for a living (and yes much of it ends up in htslib and samtools). :-) There may well be tools out there already to generate consensus; bound to be infact, including my own hacky awk and perl one-liners, but having everything in the same package is nice for usability and discoverability.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by jkbonfield200
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1371 users visited in the last hour