Confused about how to generate a consensus sequence after bwa
0
2
Entering edit mode
3.7 years ago
DNAngel ▴ 210

Hi all,

Alright so I'll edit my question here since my other post got closed. After generating consensus sequences with bcftools and vcfutils I am unsure how to get sequences without IUPAC ambiguity. My code that works in generating sequences:

samtools mpileup -Q 20 -q 20 -d 5000 -uf ref.fa sorted.bam | \
bcftools call -c | \
vcfutils.pl vcf2fq -d 10 -Q 20 -l 1 > output_consensus.fq
seqtk seq -a output_consensus.fq > output_consensus.fa


The alignment is filled with IUPAC codes but I need it to output the most called base per position or at least choose one of the bases if it happened that there was a 50/50 between bases. Using samtools and bcftools v 1.8.

Thanks!

bwa samtools mpileup • 4.6k views
3
Entering edit mode

Mapping reads to a reference is also called alignment. Therefore, mapped reads in a SAM file are aligned reads. A BAM file is a binary version of the SAM format. Typically, to save time and disk space, one does:

bwa mem (...) | samtools view -o aligned.bam


instead of bwa mem (...) > aligned.sam. To convert your SAM to BAM, do samtools view -o aligned.bam aligned.sam.

Another thing:

I'm only interested in a few candidate genes so I've grabbed the protein coding sequence from my reference species off Ensembl, and I've generated an index using bwa (to save time instead of mapping everything to the whole reference genome)

While true that this saves time, it is actually a pretty bad idea to align against a subset of the genome. The reason is that this may lead to false-positive alignments. The aligner always try to match every read to the reference and outputs the best hit. Given some of your reads are actually off-targets from the capturing (and this always happens during library prep), these might falsely be aligned to the exome rather than their true origin somewhere else in the genome. Therefore, it is good practice to align against the entire reference genome to avoid false-positives.

0
Entering edit mode

I understand that it would be better to do the whole alignment but this is part of a pipeline test (and my committee wants me to show both methods) so I am doing that. I will be doing the full alignment after once I even get enough space on a server. Right now on my end I need to align to just a few genes. I've tried many ways so far with samtools and bcftools, even trying to generate vcf files, but nothing has worked. Right now I just want to make a consensus sequence but I am at a loss at how to write my bcftools command.

So far the only thing that seemed to give the LEAST errors was:

samtools mpileup -vf ref_gene.fa sp1.aln.sorted.bam > sp1.vcf.gz
bcftools consensus sp1.vcf.gz > consensus.fa


But it says that it cannot recognize vcf.gz as a file which I don't understand why!

2
Entering edit mode

In case the content causes confusion in the future, here is the original content as it existed on 1/3/2019, 2:30 PM Eastern Time, before it was edited away:

Hi all,

I am working with data of 30 species that was generated from whole exome sequencing, I've run fastQC on the raw reads (single-end), trimmed them, ran fastqc again and the quality of all the species' reads look good. I'm only interested in a few candidate genes so I've grabbed the protein coding sequence from my reference species off Ensembl, and I've generated an index using bwa (to save time instead of mapping everything to the whole reference genome). I've completed bwa for mapping the reads to the reference gene, but now I am confused on how to actually convert all these reads per species into contigs that I can later make an alignment using MAFFT. I thought I needed to use the bwa aln method to make the alignment but that was incorrect as that is just another method for mapping the reads (I used bwa mem instead already).

I am reading up on mpileup from samtools, so I am trying: samtools mpileup -uf reference_gene.fa <aligned.bam> | bcftools some commands (haven't reached this stage yet)

But I don't understand what the aligned,bam is supposed to be. Are those my SAM reads that I generated after bwa mem? I don't think those are aligned. I also have not heard of aligning the mapped reads first before creating consensus reads. I may have just confused myself. Please help!

1
Entering edit mode

I'm only interested in a few candidate genes so I've grabbed the protein coding sequence from my reference species off Ensembl, and I've generated an index using bwa (to save time instead of mapping everything to the whole reference genome).

Why you shouldn't do this ATpoint already explained. What I like to add is, that you don't save this much time. bwa (and other mapper/aligner) using a data structure of the reference that makes it possible, that the time needed to find the correct position is independent of the reference size. Instead it depends only on the length of the read you want to align. This data structure have to get loaded into the memory and the beginning and that's the only point where a larger reference needs more time. In comparison to the whole runtime of the alignment this is very small.

Concerning your question, this is what I would do:

1. Align the reads for each species to the whole reference genome. Make sure you include ReadGroup information in this step.
2. Call variants for the regions of your interests (see example given in the manual of bcftools mpileup)
3. Create a consensus sequence (this means integrate the variants into the reference sequence) for each region of interest for each species using bcftools consensus

fin swimmer

0
Entering edit mode

I think first you can create consensus sequence and then align your reads using bwa.

0
Entering edit mode

That doesn't really make sense, can you elaborate on what you mean?

0
Entering edit mode

I've ran into a new issue, now I cannot create a consensus sequence at all and I don't understand why.

I have my indexed (from bwa) reference files and my sorted alignment bam files for 30 species in the same directory. I am trying to make consensus sequences of them all. I am even trying to do this now with vcf (even though I am not working on variant calling, I just want a consensus file).

My command:

samtools mpileup -uf ref_geneA.fa sp1.aln.sorted.bam | bcftools call -c - | vcfutils.pl vcf1fq > sp1_consensus.fq

The error message:

Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
[E::bcf_hdr_read] Invalid BCF2 magic string: only BCFv2.2 is supported
Failed to open -: could not parse header
Use of uninitialized value $l in numeric lt (<) at /usr/local/bin/vcfutils.pl line 566. Use of uninitialized value$l in numeric lt (<) at /usr/local/bin/vcfutils.pl line 566.


What is the uninitalized value $1? Isn't vcfutils.pl just converting the vcf into a fastq file which I can later convert to fasta? ADD REPLY 2 Entering edit mode Make sure you have the most recent version of bcftools (v1.9). $ bcftools mpileup -Ou -f ref.fa aln.bam | \
bcftools call -Ou -mv | \
bcftools norm -f ref.fa | \
bcftools consensus -f ref.fa -o consensus.fa

• bcftools mpileup -Ou -f ref.fa aln.bam collects metrics about each covered reference base. This information are used to find suspicious sites
• bcftools call -Ou -mv calls variants on suspicious sites
• bcftools norm -f ref.fa normalize variants
• bcftools consensus -f ref.fa -o consensus.fa creates a consensus.fa based on the variants found

fin swimmer

0
Entering edit mode

Hi, I know you mentioned to me earlier in another site that bcftools consensus will fill in gaps with bases from the reference...but I know from other scripts where colleagues have worked with paired-end data that this is not true. They were able to get the consensus sequence of their species1.bam files only, with gaps to show no coverage at a base from the reference. In my case (single-end data), my consensus sequence is merging the contigs from species1.bam (from the calls.vcf.gz file) and any gaps (which I know must exist and confirmed during tview) are filled with bases from the reference. There has to be a way to just merge contigs from the mpileup file of species1.bam without incorporating bases from the reference fasta file.

Edit: I am using bcftools 1.9 and samtools 1.9

2
Entering edit mode

Hello DNAngel,

are you the same user as here?

I is possible to create such a reference sequence. But not with bcftools alone. What you have to do is:

• Check the coverage on each reference base, e.g. with mosdepth
• extract all bases with no coverage zcat Sample.per-base.bed.gz|awk '\$4==0 {print}' > zero-coverage.bed
• masked the non covered region using bedtools maskfasta -fi input.fa -bed zero-coverage.bed -fo output.fa You can define any character you like for masking using the -mc parameter. The default is N
• Take output.fa as the reference is the bcftools consensus step

fin swimmer

1
Entering edit mode

Yes that is me! Thank you I will try this as well. Also, I think I may have gotten it with bcftools using call as you suggested but using -d 10 and -Q 20 settings to ensure high quality base calls are kept. And that (to me right now at least) makes the consensus only contain the high quality base calls from ALT. I will have to slowly go through both commands to ensure I am getting only my species bases.

Thank you fin swimmer for all the help :):):) saving my life here lol.

0
Entering edit mode

AFAIK, you can't use a modified reference genome with bcftools consensus. You'll get an error like "The fasta sequence does not match the REF allele at..." and the consensus output will die at that point (where the 'N' is in the reference). I

You CAN, however, generate the pileup with the masked reference (per sample), and then you're good on generating a consensus sequence using the same masked reference.

0
Entering edit mode

Hello DNAngel ,

vcfutils.pl vcf2fq is not the way to go, because it:

• creates a fastq file
• creates IUPAC codes and have no option to decide which ALT should be used

Please explain what's the problem with bcftools consensus and this additional description.

Thanks!

fin swimmer

0
Entering edit mode

Hi fin swimmer,

Sorry to reply to an old post but this so adequately fits my issue - I need to create a consensus sequence without IUPAC codes. But I can't use bcftools consensus because it will not generate a consensus sequence from a FASTA reference containing multiple sequences.

vcfutils.pl vcf2fq can deal with multi-sequence references but it uses IUPAC codes.

Is there any way to generate a consensus from a multi-sequence reference without IUPAC codes?

Thank you so much for your help!

0
Entering edit mode

Hello CC,

But I can't use bcftools consensus because it will not generate a consensus sequence from a FASTA reference containing multiple sequences.

can you add more information what you mean by this and how it looks like? A fasta file can contain multiple sequences, as long as they have different headers and bcftools should not have any problems with it.

fin swimmer

0
Entering edit mode

Sure, here was my bcftools consensus approach:

samtools mpileup -vf reference.fasta filename.sorted.bam | bcftools call -m -O z - > filename.vcf.gz

bcftools index filename.vcf.gz

bcftools consensus -f reference.fasta filename.vcf.gz > filename.consensus.fasta


However, when I do this using a multi-sequence reference fasta, only the first sequence comes out in the consensus.

I fixed this issue by using the vcfutils.pl vcf2fq approach, but then I have the IUPAC code issue...