What Is Mpileup Data And Where Is The Snp Data?
2
1
Entering edit mode
12.4 years ago
Sortinghat ▴ 10

I know this is a very basic question, but I've been having difficulties discovering how to analyze my mpileup data.

Because I feel that I don't know how to analyze it, I am questioning whether I understand what it is. So this is a simple question, but is necessary:

What is contained in mpileup data? According to samtools man "In the pileup format (without -uor-g), each line represents a genomic position, consisting of chromosome name, coordinate, reference base, read bases, read qualities and alignment mapping qualities."

But I fail to see where my SNP information is. I am simply looking to determine the location of SNPs that were part of the input that was run through bowtie.

If I'm not being clear, please let me know and I will add more details on what I am specifically trying to do.

Thanks in advance! -SortingHat

EDIT: So it seems that if I run pileup (I know it is depricated) that my SNPs come through perfectly, but when I run mpileup I get a ton of garbage data which has made this whole process very difficult. How can I cut down mpileup to reflect what pileup represents?

snp samtools mpileup • 18k views
ADD COMMENT
2
Entering edit mode
12.4 years ago
Rlong ▴ 340

It would be helpful to see the exact command you are using to run mpileup. If you are taking the raw output, you will not have a consensus called. In order to get consensus, you should try running mpileup as specified here:

samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf

The old pileup command used to provide a consensus, but mpileup does not. As Chris suggested, you will probably want to do some post processing. The above link to the mpileup sourceforge page can get you started on that as well.

ADD COMMENT
0
Entering edit mode

I used this: samtools mpileup inflA.sorted.bam iflA.sorted.mpileup

I assumed inflA.sorted.bam was my consensus sequence.

Thanks for the link, I'll definitely run through that information.

ADD REPLY
0
Entering edit mode

Do you have any suggestions for post processing? Thanks again! -Matt

ADD REPLY
1
Entering edit mode
12.4 years ago

The spec page for the pileup format should answer most of your questions. Every line ouput by mpileup is a potential SNP, and the information on which bases are present is encoded in the 5th column. For a realistic pipeline, post-processing is generally used to filter this list down to a high-quality and more manageable list.

ADD COMMENT
0
Entering edit mode

Ok that page definitely helps, thanks. And what would you suggest for post processing? Thanks! -Matt

ADD REPLY
0
Entering edit mode

Part of my confusion comes from the fact that in the third column I have the letter N for every read. Also, according to the link posted by rlong below, mpielup does not hold SNP information, and bcftools does the actual calling. Am I mixing something up?

Thanks again for your help! -Matt

ADD REPLY
0
Entering edit mode

If every column has an N, that's not right. That can have a couple causes. One is that there's a disconnect between what your reference sequences are called in the reference fasta, and what they are names in the .bam file. The other possibility is the step of making the references fasta index failed. mpileup will attampt to make one if one isn't already present. If it fails to do so, it will say so, but it will continue to do the mpileup anyway, with all N's instead of that column containing the reference letter.

ADD REPLY
0
Entering edit mode

How would the name change between the fasta and the bam file? And mpileup technically only needs in.bam, so I'm not sure when you are refering to making the reference fasta index failed. My basic pipeline is bowtie to map reads to a reference, output as sam, convert to bam, sort, and generate mpileup. Any specific step you think I'm losing data? Thanks for your help! -Matt

ADD REPLY
0
Entering edit mode

Check to see if there are weird characters, or spaces, that the ailgner might have truncated. Or, if your reference is not formatted right (one reason for this is if your reference sequences are one giant text string, with no internal newlines) the faidx step will fail. So just run samtools faidx on its own, and check the names there with the names in the .sam

mpileup needs a reference sequence if you want your pileup to have something other than N's in the reference letter column, or if you want to call SNPs.

Did you mean to leave out sorting the .bam?

ADD REPLY

Login before adding your answer.

Traffic: 2996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6