BAM file interpretation and consensus creation
1
0
Entering edit mode
9.3 years ago
User000 ▴ 690

Hello,

I have mapped 250M illumina paired-end reads against a 5000bp long gene using BWA and created a consensus sequence of that gene using samtools mpileup. The average depth coverage is low 5X. When I look at .bam file using samtools tview, this is what I see:

    341       351       361       371       381       391
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
                          ACCGAAGGAAAAGGCAGTAGTAGKTAGCTAGCA
                          accgaaggaaaaggcagtag****tagctagcagcaa
                                                       agtag*tagctagcagcaaccacc
                                                          TAGCTAGCAGCAACCACCA

What do 'N's stand for here? My reference gene? Or a sequence below is my reference gene?

Also, When samtools mpileup generate a consensus sequence, after looking carefully at my alignment in .bam file I noticed there are some regions that appear in my consensus sequence that has only 1 read for example and another region do not appear even if it also has only 1 read. Why is that? Does it depend on the quality of base/mapping reads? Also, is there a way to produce a consensus sequence even if there is only 1 read even if it is of low quality? May be there is a way to make a less stringent parameters for consensus. Any advice is appreciated. Also, ask me for any additional explanations if I was not clear enough.

samtools bam • 3.1k views
ADD COMMENT
0
Entering edit mode
9.3 years ago
kautilya ▴ 430

The N's stand for the sequence in your reference gene. You could open up your gene sequence and check the reported nucleotides for the positions you have mentioned.

ADD COMMENT
0
Entering edit mode

For the sake of completeness, to view the actual reference sequence along with the alignments, one simply specifies the fasta file containing the reference sequence. For example, samtools tview foo.bam foo.fa.

ADD REPLY
0
Entering edit mode

my reference gene do not contain 'N's. I opened the .bam file in tview together with a .fasta file containing my reference gene. I surely opened up my gene sequenced the checked the alignment, this is how I noticed that there are some regions that appear in my consensus sequence that has only 1 read for example and another region do not appear even if it also has only 1 read.

ADD REPLY
0
Entering edit mode

Please post the exact command that you used and also the version of samtools.

ADD REPLY
0
Entering edit mode

Version: 0.1.19-96b5f2294a

samtools tview my.bam my.fasta

I indexed my fasta file and see that there are nucleotides instead of N's, however still some N's are remaining, what does it mean? Also the sequence under is supposed to be my consensus?

ADD REPLY
1
Entering edit mode

I only ever see Ns in regions with no alignments (I've never looked into what in the code is producing this, though I presume it's using mpileup under the hood and that that is what's causing this).

Regarding the line below that, yes, it's supposed to be the consensus.

BTW, there are many more convenient viewers for BAM files these days.

ADD REPLY
0
Entering edit mode

thank you for this reply. I have another question, if you can reply would be great! When I look for a .bam file and see a consensus sequence under a reference genome, however, when I generate the same consensus with samtools mpileup, I dont see some parts of consensus which are present in .bam file, why is that? is there a way to solve this?

ADD REPLY
1
Entering edit mode

samtools mpileup is doing variant calling which is then used to generate the consensus. This is different from simply taking the base distribution and determining the consensus from it.

ADD REPLY
0
Entering edit mode

I see..that makes sense..do you think there is a way to extract the line that corresponds to consensus directly from .bam file?

ADD REPLY
0
Entering edit mode

Sure, just parse the output of samtools mpileup directly rather than using it for variant calling. You'll probably need to write a program to do this, since the method you're trying not to use is more useful than the one you're trying to use.

ADD REPLY
0
Entering edit mode

samtools mpileup -uf outputfilename.fas my_dedup.bam | bcftools view -cg - |./vcfutils.pl vcf2fq > cons.fq

ADD REPLY

Login before adding your answer.

Traffic: 1680 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6