Question: Confused about how to generate a consensus sequence after bwa
0
gravatar for DNAngel
10 weeks ago by
DNAngel20
DNAngel20 wrote:

Hi all,

Alright so I'll edit my question here since my other post got closed. After generating consensus sequences with bcftools and vcfutils I am unsure how to get sequences without IUPAC ambiguity. My code that works in generating sequences:

samtools mpileup -Q 20 -q 20 -d 5000 -uf ref.fa sorted.bam | \
bcftools call -c | \
vcfutils.pl vcf2fq -d 10 -Q 20 -l 1 > output_consensus.fq
seqtk seq -a output_consensus.fq > output_consensus.fa

The alignment is filled with IUPAC codes but I need it to output the most called base per position or at least choose one of the bases if it happened that there was a 50/50 between bases. Using samtools and bcftools v 1.8.

Thanks!

mpileup bwa samtools • 355 views
ADD COMMENTlink modified 7 weeks ago • written 10 weeks ago by DNAngel20
2

Mapping reads to a reference is also called alignment. Therefore, mapped reads in a SAM file are aligned reads. A BAM file is a binary version of the SAM format. Typically, to save time and disk space, one does:

bwa mem (...) | samtools view -o aligned.bam

instead of bwa mem (...) > aligned.sam. To convert your SAM to BAM, do samtools view -o aligned.bam aligned.sam.

Another thing:

I'm only interested in a few candidate genes so I've grabbed the protein coding sequence from my reference species off Ensembl, and I've generated an index using bwa (to save time instead of mapping everything to the whole reference genome)

While true that this saves time, it is actually a pretty bad idea to align against a subset of the genome. The reason is that this may lead to false-positive alignments. The aligner always try to match every read to the reference and outputs the best hit. Given some of your reads are actually off-targets from the capturing (and this always happens during library prep), these might falsely be aligned to the exome rather than their true origin somewhere else in the genome. Therefore, it is good practice to align against the entire reference genome to avoid false-positives.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by ATpoint13k

I understand that it would be better to do the whole alignment but this is part of a pipeline test (and my committee wants me to show both methods) so I am doing that. I will be doing the full alignment after once I even get enough space on a server. Right now on my end I need to align to just a few genes. I've tried many ways so far with samtools and bcftools, even trying to generate vcf files, but nothing has worked. Right now I just want to make a consensus sequence but I am at a loss at how to write my bcftools command.

So far the only thing that seemed to give the LEAST errors was:

samtools mpileup -vf ref_gene.fa sp1.aln.sorted.bam > sp1.vcf.gz
bcftools consensus sp1.vcf.gz > consensus.fa

But it says that it cannot recognize vcf.gz as a file which I don't understand why!

ADD REPLYlink written 10 weeks ago by DNAngel20
1

I'm only interested in a few candidate genes so I've grabbed the protein coding sequence from my reference species off Ensembl, and I've generated an index using bwa (to save time instead of mapping everything to the whole reference genome).

Why you shouldn't do this ATpoint already explained. What I like to add is, that you don't save this much time. bwa (and other mapper/aligner) using a data structure of the reference that makes it possible, that the time needed to find the correct position is independent of the reference size. Instead it depends only on the length of the read you want to align. This data structure have to get loaded into the memory and the beginning and that's the only point where a larger reference needs more time. In comparison to the whole runtime of the alignment this is very small.

Concerning your question, this is what I would do:

  1. Align the reads for each species to the whole reference genome. Make sure you include ReadGroup information in this step.
  2. Call variants for the regions of your interests (see example given in the manual of bcftools mpileup)
  3. Create a consensus sequence (this means integrate the variants into the reference sequence) for each region of interest for each species using bcftools consensus

fin swimmer

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by finswimmer10k
1

In case the content causes confusion in the future, here is the original content as it existed on 1/3/2019, 2:30 PM Eastern Time, before it was edited away:

Hi all,

I am working with data of 30 species that was generated from whole exome sequencing, I've run fastQC on the raw reads (single-end), trimmed them, ran fastqc again and the quality of all the species' reads look good. I'm only interested in a few candidate genes so I've grabbed the protein coding sequence from my reference species off Ensembl, and I've generated an index using bwa (to save time instead of mapping everything to the whole reference genome). I've completed bwa for mapping the reads to the reference gene, but now I am confused on how to actually convert all these reads per species into contigs that I can later make an alignment using MAFFT. I thought I needed to use the bwa aln method to make the alignment but that was incorrect as that is just another method for mapping the reads (I used bwa mem instead already).

I am reading up on mpileup from samtools, so I am trying: samtools mpileup -uf reference_gene.fa <aligned.bam> | bcftools some commands (haven't reached this stage yet)

But I don't understand what the aligned,bam is supposed to be. Are those my SAM reads that I generated after bwa mem? I don't think those are aligned. I also have not heard of aligning the mapped reads first before creating consensus reads. I may have just confused myself. Please help!

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by RamRS20k

I think first you can create consensus sequence and then align your reads using bwa.

ADD REPLYlink written 10 weeks ago by Ram130

That doesn't really make sense, can you elaborate on what you mean?

ADD REPLYlink written 10 weeks ago by WouterDeCoster37k

I've ran into a new issue, now I cannot create a consensus sequence at all and I don't understand why.

I have my indexed (from bwa) reference files and my sorted alignment bam files for 30 species in the same directory. I am trying to make consensus sequences of them all. I am even trying to do this now with vcf (even though I am not working on variant calling, I just want a consensus file).

My command:

samtools mpileup -uf ref_geneA.fa sp1.aln.sorted.bam | bcftools call -c - | vcfutils.pl vcf1fq > sp1_consensus.fq

The error message:

Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
[E::bcf_hdr_read] Invalid BCF2 magic string: only BCFv2.2 is supported
Failed to open -: could not parse header
Use of uninitialized value $l in numeric lt (<) at /usr/local/bin/vcfutils.pl line 566.
Use of uninitialized value $l in numeric lt (<) at /usr/local/bin/vcfutils.pl line 566.

What is the uninitalized value $1? Isn't vcfutils.pl just converting the vcf into a fastq file which I can later convert to fasta?

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by DNAngel20

Make sure you have the most recent version of bcftools (v1.9).

$ bcftools mpileup -Ou -f ref.fa aln.bam | \
bcftools call -Ou -mv | \
bcftools norm -f ref.fa | \
bcftools consensus -f ref.fa -o consensus.fa
  • bcftools mpileup -Ou -f ref.fa aln.bam collects metrics about each covered reference base. This information are used to find suspicious sites
  • bcftools call -Ou -mv calls variants on suspicious sites
  • bcftools norm -f ref.fa normalize variants
  • bcftools consensus -f ref.fa -o consensus.fa creates a consensus.fa based on the variants found

fin swimmer

ADD REPLYlink written 10 weeks ago by finswimmer10k

Hi, I know you mentioned to me earlier in another site that bcftools consensus will fill in gaps with bases from the reference...but I know from other scripts where colleagues have worked with paired-end data that this is not true. They were able to get the consensus sequence of their species1.bam files only, with gaps to show no coverage at a base from the reference. In my case (single-end data), my consensus sequence is merging the contigs from species1.bam (from the calls.vcf.gz file) and any gaps (which I know must exist and confirmed during tview) are filled with bases from the reference. There has to be a way to just merge contigs from the mpileup file of species1.bam without incorporating bases from the reference fasta file.

Edit: I am using bcftools 1.9 and samtools 1.9

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by DNAngel20

Hello DNAngel,

are you the same user as here?

I is possible to create such a reference sequence. But not with bcftools alone. What you have to do is:

  • Check the coverage on each reference base, e.g. with mosdepth
  • extract all bases with no coverage zcat Sample.per-base.bed.gz|awk '$4==0 {print}' > zero-coverage.bed
  • masked the non covered region using bedtools maskfasta -fi input.fa -bed zero-coverage.bed -fo output.fa You can define any character you like for masking using the -mc parameter. The default is N
  • Take output.fa as the reference is the bcftools consensus step

fin swimmer

ADD REPLYlink written 9 weeks ago by finswimmer10k

Yes that is me! Thank you I will try this as well. Also, I think I may have gotten it with bcftools using call as you suggested but using -d 10 and -Q 20 settings to ensure high quality base calls are kept. And that (to me right now at least) makes the consensus only contain the high quality base calls from ALT. I will have to slowly go through both commands to ensure I am getting only my species bases.

Thank you fin swimmer for all the help :):):) saving my life here lol.

ADD REPLYlink written 9 weeks ago by DNAngel20

Hello DNAngel ,

please do not edit your posts in that way, that no one can see to which question already existing answers/comments belong.

vcfutils.pl vcf2fq is not the way to go, because it:

  • creates a fastq file
  • creates IUPAC codes and have no option to decide which ALT should be used

Please explain what's the problem with bcftools consensus and this additional description.

Thanks!

fin swimmer

ADD REPLYlink written 7 weeks ago by finswimmer10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 754 users visited in the last hour