Samtools Mpileup Does Not Contain Sample Name
2
0
Entering edit mode
10.2 years ago
komal.rathi ★ 4.1k

Hi everyone,

I am using samtools mpileup to generate a mpileup file from 64 bam files using a set of chromosomal positions in bed format. The mpileup file does not contain the names of the samples of the input bam files. Is there a way by which you could keep the names of the samples intact in the pileup output? I have multiple samples from normal and heart failure tissue and it would be helpful if I could figure out any snps/indels in heart failure vs control samples. I used the following command:

samtools mpileup -f hg19.fa -l test.bed -b bamfile_list.txt > test.mpileup

*test.bed contains the positions for which pileup should be generated.
*bamfile_list.txt is my list of bam files, one file per line.

Also, will the tool take each file in order of its occurence in the bamfile_list.txt and create the columns? If yes, I could know which column belongs to which sample in the output by looking at the order at bamfile_list.txt.

Thanks

samtools mpileup • 6.0k views
ADD COMMENT
2
Entering edit mode

Does your bam files have sample name or SM tag with the unique sample name for each sample ?

ADD REPLY
1
Entering edit mode
10.2 years ago

The column information comes from the bam files. You need a line in each BAM file header stating the sample name. Here is what a line looks like in one of my BAM files.

@RG     ID:I248_FCB068VABXX_L5_COLcalRAIDIAAPE  SM:I248_FCB068VABXX_L5_COLcalRAIDIAAPE

The column name in the VCF will be: I248_FCB068VABXX_L5_COLcalRAIDIAAPE

ADD COMMENT
0
Entering edit mode

I don't want to call variants (VCF), just want to generate a mpileup file.

ADD REPLY
0
Entering edit mode

Okay then David W 's answer is what you want. Keep in mind working with pileups is difficult. You might also find that calling variants is useful for other analyses.

ADD REPLY
0
Entering edit mode

I know you are right. But the problem is we have a list of chromosomal positions for which we want to analyse the count of As, Ts, Gs and Cs in 64 samples. We don't want to call variants because it will not output all the positions and just the variants. Also, it won't tell us the A,T,G,C counts at each position for each sample.

ADD REPLY
0
Entering edit mode

Yes, I understand what you are trying to do. I would take a look at Unified Genotype's annotation options. There are several options that appear to do what you want...

http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_AlleleBalanceBySample.html#VariantAnnotations

BaseCounts might do the trick?

http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_BaseCounts.html

ADD REPLY
0
Entering edit mode

Thanks! Although there is no documentation for BaseCounts and AlleleBalanceBySample is still in the experimental stage, I will give it a try.

ADD REPLY
0
Entering edit mode

Yeah the documentation is kinda a bummer. Let us know if that works..

ADD REPLY
0
Entering edit mode

Why can't you run mpileup on 64 BAM files separately and then merge the results later into one file. The way you are doing is putting all the information together in one mpileup file. Instead create 64 pileups and then count A,T,C,G frequency.

ADD REPLY
0
Entering edit mode

Hmm I had given it a thought but as mpileup takes a lot of time, I just wanted to look at other options that can efficiently handle the output of a single mpileup file.

ADD REPLY
0
Entering edit mode

Well it can be a tedious task. But you can write some script that will automatically submit the jobs to cluster. As you said that you are not interested in all the positions but a set of coordinates from the bed file. It should be quick.

ADD REPLY
1
Entering edit mode
10.2 years ago
David W 4.9k

As far as I know, the pileup format doesn't provide informaton on the samples from which reads come. You could create a series of pileups for each of your samples (one per BAM? Or using samtools view with the R option to create BAMs which include only a subset of readgroups).

But getting from BAMs/pileups to genotypes and variants isn't straightfoward. Depending on exatly what you are trying to do it migt make more sense to use a variant caller (like samtools mpileup with the -g option or gatk) to identify differences among your samples

ADD COMMENT

Login before adding your answer.

Traffic: 1746 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6