Question: BAM file interpretation and consensus creation
0
gravatar for User000
2.7 years ago by
User000240
User000240 wrote:

Hello,

I have mapped 250M illumina paired-end reads against a 5000bp long gene using BWA and created a consensus sequence of that gene using samtools mpileup. The average depth coverage is low 5X. When I look at .bam file using samtools tview, this is what I see:

    341       351       361       371       381       391
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
                          ACCGAAGGAAAAGGCAGTAGTAGKTAGCTAGCA
                          accgaaggaaaaggcagtag****tagctagcagcaa
                                                       agtag*tagctagcagcaaccacc
                                                          TAGCTAGCAGCAACCACCA    

 

What do 'N's stand for here? My reference gene? Or a sequence below is my reference gene?

Also, When samtools mpileup generate a consensus sequence, after looking carefully at my alignment in .bam file I noticed there are some regions that appear in my consensus sequence that has only 1 read for example and another region do not appear even if it also has only 1 read. Why is that? Does it depend on the quality of base/mapping reads? Also, is there a way to produce a consensus sequence even if there is only 1 read even if it is of low quality? May be there is a way to make a less stringent parameters for consensus. Any advice is appreciated. Also, ask me for any additional explanations if I was not clear enough.

samtools bam • 1.2k views
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by User000240
0
gravatar for kautilya
2.7 years ago by
kautilya340
United States
kautilya340 wrote:

The N's stand for the sequence in your reference gene. You could open up your gene sequence and check the reported nucleotides for the positions you have mentioned.

ADD COMMENTlink written 2.7 years ago by kautilya340

For the sake of completeness, to view the actual reference sequence along with the alignments, one simply specifies the fasta file containing the reference sequence. For example, samtools tview foo.bam foo.fa.

ADD REPLYlink written 2.7 years ago by Devon Ryan70k

my reference gene do not contain 'N's. I opened the .bam file in tview together with a .fasta file containing my reference gene. I surely opened up my gene sequenced the checked the alignment, this is how I noticed that there are some regions that appear in my consensus sequence that has only 1 read for example and another region do not appear even if it also has only 1 read.

ADD REPLYlink written 2.7 years ago by User000240

Please post the exact command that you used and also the version of samtools.

ADD REPLYlink written 2.7 years ago by Devon Ryan70k

Version: 0.1.19-96b5f2294a

samtools tview my.bam my.fasta

I indexed my fasta file and see that there are nucleotides instead of N's, however still some N's are remaining, what does it mean? Also the sequence under is supposed to be my consensus?

ADD REPLYlink written 2.7 years ago by User000240
1

I only ever see Ns in regions with no alignments (I've never looked into what in the code is producing this, though I presume it's using mpileup under the hood and that that is what's causing this).

Regarding the line below that, yes, it's supposed to be the consensus.

BTW, there are many more convenient viewers for BAM files these days.

ADD REPLYlink written 2.7 years ago by Devon Ryan70k

thank you for this reply. I have another question, if you can reply would be great! When I look for a .bam file and see a consensus sequence under a reference genome, however, when I generate the same consensus with samtools mpileup, I dont see some parts of consensus which are present in .bam file, why is that? is there a way to solve this?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by User000240
1

samtools mpileup is doing variant calling which is then used to generate the consensus. This is different from simply taking the base distribution and determining the consensus from it.

ADD REPLYlink written 2.7 years ago by Devon Ryan70k

I see..that makes sense..do you think there is a way to extract the line that corresponds to consensus directly from .bam file?

ADD REPLYlink written 2.7 years ago by User000240

Sure, just parse the output of samtools mpileup directly rather than using it for variant calling. You'll probably need to write a program to do this, since the method you're trying not to use is more useful than the one you're trying to use.

ADD REPLYlink written 2.7 years ago by Devon Ryan70k

samtools mpileup -uf outputfilename.fas my_dedup.bam | bcftools view -cg - |./vcfutils.pl vcf2fq > cons.fq

ADD REPLYlink written 2.7 years ago by User000240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1397 users visited in the last hour