Generating consensus sequence from bam file
9
7
Entering edit mode
5.0 years ago
chparada ▴ 70

Hi,

I am trying to generate consensus sequence from a bam file obtained after mapping SRA reads to a reference genome.

I used the following commands:

bwa mem ref.fasta SRR_1.fastq SRR_2.fastq > bwa.sam
samtools view -b -F 4 bwa.sam > bwa_aligned.bam
samtools index bwa_aligned.bam

I am not sure how to generate the consensus sequence that I have in mind. In case I don't explain this well. I made a diagram:

===========================================================>ref.fasta
- -- ---- ----      ----- --- --- -      --    ------ ----- 
------ --- ---     -------- --- --       ---- -      --- -->SRR_reads_mapping



==============  +  ================  +   ==================> consensus_sequence.fasta

Please let me know if you have any advice on this.

Cheers!!!

genome samtools bwa fasta • 32k views
ADD COMMENT
0
Entering edit mode

What do you mean by consensus sequence? The most frequent nucleotide in each position? The variants? There are plenty of tools that can give you the reading in each position, try bcftools mpileup for instance.

ADD REPLY
0
Entering edit mode

Thank you for your answers.

I meant obtaining the most frequent nucleotide in each position. From the mapped reads, I want to been able to obtain a consensus sequence in single fasta file. I am going to try bcftools mpileup.

ADD REPLY
0
Entering edit mode

see below my answer thanks

ADD REPLY
10
Entering edit mode
4.1 years ago
vsingh ▴ 100

1) Map short reads against reference gene sequence

# Create bowtie2 database
bowtie2-build REFERENCE.fasta REF_DB

# bowtie2 mapping
bowtie2 -x REF_DB -U SAMPLE.fastq --no-unal -S SAMPLE.sam

# samtools:  sort .sam file and convert to .bam file
samtools view -bS SAMPLE.sam | samtools sort - -o SAMPLE.bam

2) Get consensus sequence from .bam file

# Get consensus fastq file
samtools mpileup -uf REFERENCE.fasta SAMPLE.bam | bcftools call -c | vcfutils.pl vcf2fq > SAMPLE_cns.fastq

  # vcfutils.pl is part of bcftools

# Convert .fastq to .fasta and set bases of quality lower than 20 to N
seqtk seq -aQ64 -q20 -n N SAMPLE_cns.fastq > SAMPLE_cns.fasta
ADD COMMENT
5
Entering edit mode
5.0 years ago
nsmi8446 ▴ 170

You could call variants (using whatever variant calling software you like, GATK, freebayes etc.) from your .bam file and then use vcf-consensus (http://vcftools.sourceforge.net/perl_module.html#vcf-consensus) to build your consensus sequence. The code below should work:

cat ref.fa | vcf-consensus file.vcf.gz > out.fa

ADD COMMENT
1
Entering edit mode

I agree with nsmi8446. This is a nice concise way to solve your problem.

ADD REPLY
5
Entering edit mode
18 months ago
aswinssoman ▴ 50

The lastest version of samtools_V1.16 allows you to generate a consensus from a SAM, BAM or CRAM file based on the contents of the alignment records. The consensus is written either as FASTA, FASTQ, or a pileup oriented format. This is selected using the -f FORMAT option.

Eg:

samtools consensus -f fasta in.bam -o cons.fa 
ADD COMMENT
3
Entering edit mode
4.8 years ago
waldeyr ▴ 30
samtools mpileup -uf my_reference.fna my_file.bam | bcftools view -cg - | vcfutils.pl vcf2fq > my_consensus.fq
ADD COMMENT
2
Entering edit mode
4.8 years ago
trausch ★ 1.8k

Alfred has a consensus mode that extracts all reads at a given alignment position and then runs a multiple sequence alignment computation with consensus generation. It's primarily for long reads but I think it also works for short reads.

alfred consensus -t ill -f bam -p chr4:500500 input.bam
ADD COMMENT
2
Entering edit mode
4.8 years ago
chen ★ 2.5k

try gencore: https://github.com/OpenGene/gencore

Generate consensus reads to reduce sequencing noises and remove duplications

ADD COMMENT
2
Entering edit mode
4.8 years ago
guillaume.rbt ★ 1.0k

you can use GATK FastaAlternateReferenceMaker to generate the consensus sequence based on a SNP calling : https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_fasta_FastaAlternateReferenceMaker.php

ADD COMMENT
2
Entering edit mode
8 months ago
DareDevil ★ 4.3k
samtools mpileup -uf reference.fasta sorted_aligned_reads.bam | bcftools call -c | vcfutils.pl vcf2fq > consensus.fastq

Here's a breakdown of the command:

samtools mpileup: Generates a pileup of aligned reads at each position in the reference genome.

-u: Output in uncompressed BAM format.

-f reference.fasta: Specifies the reference genome in FASTA format.

sorted_aligned_reads.bam: The sorted and indexed BAM file.

bcftools call: Calls the consensus genotype for each position based on the pileup.

-c: Output the consensus genotype.

vcfutils.pl vcf2fq: Converts the consensus genotype in VCF format to FASTQ format, representing the consensus sequence.

consensus.fastq: The output file containing the consensus sequence in FASTQ format.

Please make sure to replace reference.fasta with the filename of your reference genome and sorted_aligned_reads.bam with the appropriate name of your sorted and indexed BAM file.

After running this script, you should obtain the consensus sequence in the consensus.fastq file. If you prefer a FASTA format instead of FASTQ, you can use tools like seqtk or fastq_to_fasta to convert the FASTQ file to FASTA format if needed.

ADD COMMENT
0
Entering edit mode

samtools mpileup is depriciated, so replace it with bcftools mpileup

ADD REPLY
0
Entering edit mode
8 months ago
jkbonfield ★ 1.2k

Since this was written, samtools now has a samtools consensus command. See the man page for controlling the coordinate system and whether indels are there.

It's much faster than the bcftools route, but potentially not as accurate in some scenarios (and more accurate in others, particularly with long read data). Something to try anyway.

ADD COMMENT
0
Entering edit mode

This is interesting, thank you. Do you have any further information on why it might be more accurate with long read data? And what roughly constitutes 'long' in this case? For example, we have reads derived from amplicon sequencing using ONT that are on average 1.2 kb in length, would you expect a possible improvement with samtools consensus compared to the other approach of SNP calling --> bcftools consensus?

ADD REPLY
1
Entering edit mode

A late reply I know. The main difference is the error models associated with long-read technologies - the actual read length doesn't matter really here. Consensus generation assuming most errors are substitution ones isn't so realiable when many newer technologies have indel as the primary error instead.

ADD REPLY

Login before adding your answer.

Traffic: 2367 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6