Question: Samtools Mpileup Does Not Contain Sample Name
0
gravatar for komal.rathi
5.2 years ago by
komal.rathi3.4k
Children's Hospital of Philadelphia, Philadelphia, PA
komal.rathi3.4k wrote:

Hi everyone,

I am using samtools mpileup to generate a mpileup file from 64 bam files using a set of chromosomal positions in bed format. The mpileup file does not contain the names of the samples of the input bam files. Is there a way by which you could keep the names of the samples intact in the pileup output? I have multiple samples from normal and heart failure tissue and it would be helpful if I could figure out any snps/indels in heart failure vs control samples. I used the following command:

samtools mpileup -f hg19.fa -l test.bed -b bamfile_list.txt > test.mpileup

*test.bed contains the positions for which pileup should be generated.
*bamfile_list.txt is my list of bam files, one file per line.

Also, will the tool take each file in order of its occurence in the bamfile_list.txt and create the columns? If yes, I could know which column belongs to which sample in the output by looking at the order at bamfile_list.txt.

Thanks

samtools mpileup • 3.5k views
ADD COMMENTlink modified 5.2 years ago by David W4.7k • written 5.2 years ago by komal.rathi3.4k
2

Does your bam files have sample name or SM tag with the unique sample name for each sample ?

ADD REPLYlink written 5.2 years ago by Ashutosh Pandey11k
1
gravatar for Zev.Kronenberg
5.2 years ago by
United States
Zev.Kronenberg11k wrote:

The column information comes from the bam files. You need a line in each BAM file header stating the sample name. Here is what a line looks like in one of my BAM files.

@RG     ID:I248_FCB068VABXX_L5_COLcalRAIDIAAPE  SM:I248_FCB068VABXX_L5_COLcalRAIDIAAPE

The column name in the VCF will be: I248_FCB068VABXX_L5_COLcalRAIDIAAPE

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Zev.Kronenberg11k

I don't want to call variants (VCF), just want to generate a mpileup file.

ADD REPLYlink written 5.2 years ago by komal.rathi3.4k

Okay then David W 's answer is what you want. Keep in mind working with pileups is difficult. You might also find that calling variants is useful for other analyses.

ADD REPLYlink written 5.2 years ago by Zev.Kronenberg11k

I know you are right. But the problem is we have a list of chromosomal positions for which we want to analyse the count of As, Ts, Gs and Cs in 64 samples. We don't want to call variants because it will not output all the positions and just the variants. Also, it won't tell us the A,T,G,C counts at each position for each sample.

ADD REPLYlink written 5.2 years ago by komal.rathi3.4k

Yes, I understand what you are trying to do. I would take a look at Unified Genotype's annotation options. There are several options that appear to do what you want...

http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_AlleleBalanceBySample.html#VariantAnnotations

BaseCounts might do the trick?

http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_BaseCounts.html

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by Zev.Kronenberg11k

Thanks! Although there is no documentation for BaseCounts and AlleleBalanceBySample is still in the experimental stage, I will give it a try.

ADD REPLYlink written 5.2 years ago by komal.rathi3.4k

Yeah the documentation is kinda a bummer. Let us know if that works..

ADD REPLYlink written 5.2 years ago by Zev.Kronenberg11k

Why can't you run mpileup on 64 BAM files separately and then merge the results later into one file. The way you are doing is putting all the information together in one mpileup file. Instead create 64 pileups and then count A,T,C,G frequency.

ADD REPLYlink written 5.2 years ago by Ashutosh Pandey11k

Hmm I had given it a thought but as mpileup takes a lot of time, I just wanted to look at other options that can efficiently handle the output of a single mpileup file.

ADD REPLYlink written 5.2 years ago by komal.rathi3.4k

Well it can be a tedious task. But you can write some script that will automatically submit the jobs to cluster. As you said that you are not interested in all the positions but a set of coordinates from the bed file. It should be quick.

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by Ashutosh Pandey11k
1
gravatar for David W
5.2 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

As far as I know, the pileup format doesn't provide informaton on the samples from which reads come. You could create a series of pileups for each of your samples (one per BAM? Or using samtools view with the R option to create BAMs which include only a subset of readgroups).

But getting from BAMs/pileups to genotypes and variants isn't straightfoward. Depending on exatly what you are trying to do it migt make more sense to use a variant caller (like samtools mpileup with the -g option or gatk) to identify differences among your samples

ADD COMMENTlink written 5.2 years ago by David W4.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1213 users visited in the last hour