Hello,
I have mapped 250M illumina paired-end reads against a 5000bp long gene using BWA and created a consensus sequence of that gene using samtools mpileup
. The average depth coverage is low 5X. When I look at .bam file using samtools tview
, this is what I see:
341 351 361 371 381 391
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ACCGAAGGAAAAGGCAGTAGTAGKTAGCTAGCA
accgaaggaaaaggcagtag****tagctagcagcaa
agtag*tagctagcagcaaccacc
TAGCTAGCAGCAACCACCA
What do 'N's stand for here? My reference gene? Or a sequence below is my reference gene?
Also, When samtools mpileup generate a consensus sequence, after looking carefully at my alignment in .bam file I noticed there are some regions that appear in my consensus sequence that has only 1 read for example and another region do not appear even if it also has only 1 read. Why is that? Does it depend on the quality of base/mapping reads? Also, is there a way to produce a consensus sequence even if there is only 1 read even if it is of low quality? May be there is a way to make a less stringent parameters for consensus. Any advice is appreciated. Also, ask me for any additional explanations if I was not clear enough.
For the sake of completeness, to view the actual reference sequence along with the alignments, one simply specifies the fasta file containing the reference sequence. For example,
samtools tview foo.bam foo.fa
.my reference gene do not contain 'N's. I opened the .bam file in tview together with a .fasta file containing my reference gene. I surely opened up my gene sequenced the checked the alignment, this is how I noticed that there are some regions that appear in my consensus sequence that has only 1 read for example and another region do not appear even if it also has only 1 read.
Please post the exact command that you used and also the version of samtools.
Version: 0.1.19-96b5f2294a
samtools tview my.bam my.fasta
I indexed my fasta file and see that there are nucleotides instead of N's, however still some N's are remaining, what does it mean? Also the sequence under is supposed to be my consensus?
I only ever see Ns in regions with no alignments (I've never looked into what in the code is producing this, though I presume it's using mpileup under the hood and that that is what's causing this).
Regarding the line below that, yes, it's supposed to be the consensus.
BTW, there are many more convenient viewers for BAM files these days.
thank you for this reply. I have another question, if you can reply would be great! When I look for a .bam file and see a consensus sequence under a reference genome, however, when I generate the same consensus with samtools mpileup, I dont see some parts of consensus which are present in .bam file, why is that? is there a way to solve this?
samtools mpileup is doing variant calling which is then used to generate the consensus. This is different from simply taking the base distribution and determining the consensus from it.
I see..that makes sense..do you think there is a way to extract the line that corresponds to consensus directly from .bam file?
Sure, just parse the output of samtools mpileup directly rather than using it for variant calling. You'll probably need to write a program to do this, since the method you're trying not to use is more useful than the one you're trying to use.
samtools mpileup -uf outputfilename.fas my_dedup.bam | bcftools view -cg - |./vcfutils.pl vcf2fq > cons.fq