Question: Tune criteria for extraction of consensus sequence from bam/sam alignment
1
gravatar for al-ash
17 months ago by
al-ash100
Japan/Okinawa/OIST
al-ash100 wrote:

I'd like to extract consensus of reads from a bam/sam file which would follow particular criteria:

  1. The portions of reference which have 0 coverage are represented in the consensus as N
  2. If there is at least single read mapping to a particular nucleotide, this nucleotide is retrieved in a consensus.
  3. If there is conflicting evidence from reads, the nucleotide with highest frequency (lets say 75%) is retrieved in consensus. If there is no nucleotide with >= 75% frequency, N is retrieved in the consensus.

Is there a tool which would enable this or similar level of fine-tuning of the consensus extraction from bam/sam?

What I tried: I used samtools, bcftools and vcfutils to get consensus:

samtools mpileup -uf reference.nt mapping.bam | bcftools call -c | vcfutils.pl vcf2fq > consensus.fasta

However, the consensus I'm getting this way contains a substantial proportion of the reference sequence and I can't find parametres which could modify how the consensus is calculated.

Yet another approach - extracting consensus using IGV viewer (option Copy consensus sequence) gives me basically the consensus I'm looking for but there is apparently no way how to automatize this for hundreds of reference sequences which I have.

Note: I posted this question also as a comment to my original question Map eukaryotic genomic reads to transcript reference of closely related organism however I'm posting it here separately as it diverged from my original question.

sam bcftools bam vcfutils consensus • 1.1k views
ADD COMMENTlink modified 16 months ago • written 17 months ago by al-ash100

Hi al-ash. I am looking for similar answer for your 3rd question but instead of N I just want to have reference base only. Could you figure how to do that? Perhaps you can share if you had found some way. Thanks.

ADD REPLYlink written 16 months ago by Prakki Rama2.2k

Hi Prakki, that sounds a bit more suitable task for the combination of samtools and bcftools I gave above (after possibly tuning the parametres) - did you try to play with these tools?

Or you might give a try to e.g. FastaAlternateReferenceMaker (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_fasta_FastaAlternateReferenceMaker.php) which "replaces the reference bases at variation sites with the bases supplied in the corresponding callset records".

I do not know your exact requirements for the consensus but my main question was to make sure that actually no bases from reference are included in the consensus. On the other side, your requirements should be easier to fulfill by some of the tools above as they are designed to obtain this type of consensus you require.

ADD REPLYlink written 16 months ago by al-ash100

Hi al-ash. Thanks for comment. Yes. I tried samtools and bcftools. But, could you manage to set the cut off of 75% frequency? The nearest parameter I could come across is --max-maf 0.25 in the vcftools. Is that what you used as well for cutoff? In my case, I am okay with the reference included in consensus.

ADD REPLYlink written 16 months ago by Prakki Rama2.2k

Prakki, unfortunately I can't help you here since I was not testing the vcftools parametres. I did not yet manage to fine-tune the extraction of consensus, but as I wrote, that is mainly because I do not want to generate consensus which would have the nucleotide bases of reference at the position where no variant was called - I want to have there Ns whenever the identity of nucleotide at that position is well supported by read evidence. In your case, however, vcftools parametres, as you suggest, might solve your problem.

ADD REPLYlink written 16 months ago by al-ash100

That's fine al-ash. Thanks for your comments! I am trying to figure out!

ADD REPLYlink written 16 months ago by Prakki Rama2.2k
2
gravatar for al-ash
16 months ago by
al-ash100
Japan/Okinawa/OIST
al-ash100 wrote:

simple-consensus-per-read-group by Thomas Sibley is the tool I was looking for (https://github.com/MullinsLab/simple-consensus-per-read-group). It was recently updated so it can now correctly process BAMs with multiple scaffolds/contigs.

ADD COMMENTlink written 16 months ago by al-ash100

Thank you al-ash for your answer. I had some problems with pysam to run the above tool. For your case, you can generate the coordinates for those position in bed format and pass your reference to maskfasta from bedtools which will in turn convert them to N's.

ADD REPLYlink modified 15 months ago • written 15 months ago by Prakki Rama2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1159 users visited in the last hour