Question: What Is Mpileup Data And Where Is The Snp Data?
gravatar for Sortinghat
8.2 years ago by
Sortinghat10 wrote:

I know this is a very basic question, but I've been having difficulties discovering how to analyze my mpileup data.

Because I feel that I don't know how to analyze it, I am questioning whether I understand what it is. So this is a simple question, but is necessary:

What is contained in mpileup data? According to samtools man "In the pileup format (without -uor-g), each line represents a genomic position, consisting of chromosome name, coordinate, reference base, read bases, read qualities and alignment mapping qualities."

But I fail to see where my SNP information is. I am simply looking to determine the location of SNPs that were part of the input that was run through bowtie.

If I'm not being clear, please let me know and I will add more details on what I am specifically trying to do.

Thanks in advance! -SortingHat

EDIT: So it seems that if I run pileup (I know it is depricated) that my SNPs come through perfectly, but when I run mpileup I get a ton of garbage data which has made this whole process very difficult. How can I cut down mpileup to reflect what pileup represents?

samtools snp mpileup • 13k views
ADD COMMENTlink modified 4.3 years ago by Biostar ♦♦ 20 • written 8.2 years ago by Sortinghat10
gravatar for Rlong
8.2 years ago by
Rlong340 wrote:

It would be helpful to see the exact command you are using to run mpileup. If you are taking the raw output, you will not have a consensus called. In order to get consensus, you should try running mpileup as specified here:

samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf

The old pileup command used to provide a consensus, but mpileup does not. As Chris suggested, you will probably want to do some post processing. The above link to the mpileup sourceforge page can get you started on that as well.

ADD COMMENTlink modified 5 months ago by RamRS25k • written 8.2 years ago by Rlong340

I used this: samtools mpileup inflA.sorted.bam iflA.sorted.mpileup

I assumed inflA.sorted.bam was my consensus sequence.

Thanks for the link, I'll definitely run through that information.

ADD REPLYlink written 8.2 years ago by Sortinghat10

Do you have any suggestions for post processing? Thanks again! -Matt

ADD REPLYlink written 8.2 years ago by Sortinghat10
gravatar for Chris Miller
8.2 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

The spec page for the pileup format should answer most of your questions. Every line ouput by mpileup is a potential SNP, and the information on which bases are present is encoded in the 5th column. For a realistic pipeline, post-processing is generally used to filter this list down to a high-quality and more manageable list.

ADD COMMENTlink written 8.2 years ago by Chris Miller21k

Ok that page definitely helps, thanks. And what would you suggest for post processing? Thanks! -Matt

ADD REPLYlink written 8.2 years ago by Sortinghat10

Part of my confusion comes from the fact that in the third column I have the letter N for every read. Also, according to the link posted by rlong below, mpielup does not hold SNP information, and bcftools does the actual calling. Am I mixing something up?

Thanks again for your help! -Matt

ADD REPLYlink written 8.2 years ago by Sortinghat10

If every column has an N, that's not right. That can have a couple causes. One is that there's a disconnect between what your reference sequences are called in the reference fasta, and what they are names in the .bam file. The other possibility is the step of making the references fasta index failed. mpileup will attampt to make one if one isn't already present. If it fails to do so, it will say so, but it will continue to do the mpileup anyway, with all N's instead of that column containing the reference letter.

ADD REPLYlink written 8.2 years ago by Swbarnes21.5k

How would the name change between the fasta and the bam file? And mpileup technically only needs in.bam, so I'm not sure when you are refering to making the reference fasta index failed. My basic pipeline is bowtie to map reads to a reference, output as sam, convert to bam, sort, and generate mpileup. Any specific step you think I'm losing data? Thanks for your help! -Matt

ADD REPLYlink written 8.2 years ago by Sortinghat10

Check to see if there are weird characters, or spaces, that the ailgner might have truncated. Or, if your reference is not formatted right (one reason for this is if your reference sequences are one giant text string, with no internal newlines) the faidx step will fail. So just run samtools faidx on its own, and check the names there with the names in the .sam

mpileup needs a reference sequence if you want your pileup to have something other than N's in the reference letter column, or if you want to call SNPs.

Did you mean to leave out sorting the .bam?

ADD REPLYlink written 8.2 years ago by Swbarnes21.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1827 users visited in the last hour