Question

Per Sample Mpileup Output From Merged Bam File

1

Entering edit mode

10.9 years ago

thomasvangurp ▴ 20

I'd like to run mpileup on a bam file with two different samples. The samples are identified in the header in the following way:

@RG    ID:ApeKI_s2.sort    SM:ApeKI_s2.sort    LB:citrus_ApeKI    PL:ILLUMINA
@RG    ID:ApeKI_s12.sort    SM:ApeKI_s12.sort    LB:citrus_ApeKI    PL:ILLUMINA

I added the RG tag to all entries in the bam file using

samtools merge -rh header.sam merged.bam ApeKI_s12.sort.bam ApeKI_s2.sort.bam

The RG tags are present in the reads:

ApeKI_common_s2_5511537    99    scaffold00001    244    60    25S44M3D27M    =    244    83    CTGCTTCCTTTATGTTCGTGCATTCTTCTTACTCTGATTTAGATGGCGAAGGTTTTAAGCTAACTTTTTATGTCAATTTTTAAATGGGTTTCTAAT    FFFFHHHHHJJJJJJJJJHIJJJJJJJJJJJJJJJJJJJJJJJJIIJJJJJJFHIJJJJJIJJJJHHHHHHEFFFFFFEEEDEEEEDDABDDDEEE    NM:i:7    AS:i:42    XS:i:0    RG:Z:ApeKI_s2.sort
ApeKI_common__2474514    147    scaffold00001    244    60    16S44M3D36M4S    =    244    -83    TTATGTTCGTGCATTCTTCTTACTCTGATTTAGATGGCGAAGGTTTTAAGCTAACTTTTTATGTCAATTTTTAAATGGGTTTCTAATACTGTTCAAGCTG    DDDDDDDDDDDDEEEEFFFFD@HHHHHHIJJIJJJJJJJJJJJJJJJIHGGJIIJJJIJIIIJIHJJJJJJJJJJIJJJJJJIJIIHFHHHHDFFFFC@C    NM:i:8    AS:i:46    XS:i:0    RG:Z:ApeKI_s12.sort

when I run samtools mpileup on merged.bam I get this:

[mpileup] 2 samples in 1 input files
<mpileup> Set max per-file depth to 4000
scaffold00001    244    N    38    ^[T^[T^[T^]T^]T^]T^]t^]t^[T^]t^]t^]T^]t^]T^]t^]t^]T^[T^]t^]t^]T^]t^]T^]T^]t^]t^]t^]t^]T^]t^]T^]T^]t^]t^]T^]t^]t^]T    JJJHIJFFJFDJDJFDJJFEJEJJEDFEJDHJFFJFEJ
scaffold00001    245    N    38    TTTTTTttTttTtTttTTttTtTTttttTtTTttTttT    JJJIGJFFJFDJDJFDIJFEJEJJFFFDJDJJFFIFCJ
scaffold00001    246    N    38    CCCCCCccCccCcCccCCccCcCCccccCcCCccCccC    JJJGJJFFJFDJDJFDJJFEJEJIFCDEJCJJFFJFDI
scaffold00001    247    N    38    TTTTTTttTttTtTttTTttTtTTttttTtTTttTttT    JJJDJJFFJFEJEIFEJJFDJEJICEEEJEJJFFJDEJ

So samtools recognizes that there are indeed 2 samples in my input file. I would however expect to get SNP information on both lines as in the following output obtained by running:

samtools mpileup ApeKI_s2.bam ApeKI_s12.bam |head
[mpileup] 2 samples in 2 input files
<mpileup> Set max per-file depth to 4000
scaffold00001    244    N    22    ^]T^]T^]T^]t^]T^]t^]T^]T^]t^]t^]t^]t^]T^]t^]T^]T^]t^]t^]T^]t^]t^]T    HIJEJEJJEDFEJDHJFFJFEJ    16    ^[T^[T^[T^]t^]t^[T^]t^]t^]T^]t^]T^]t^]t^]T^[T^]t    JJJFFJFDJDJFDJJF
scaffold00001    245    N    22    TTTtTtTTttttTtTTttTttT    IGJEJEJJFFFDJDJJFFIFCJ    16    TTTttTttTtTttTTt    JJJFFJFDJDJFDIJF
scaffold00001    246    N    22    CCCcCcCCccccCcCCccCccC    GJJEJEJIFCDEJCJJFFJFDI    16    CCCccCccCcCccCCc    JJJFFJFDJDJFDJJF
scaffold00001    247    N    22    TTTtTtTTttttTtTTttTttT    DJJDJEJICEEEJEJJFFJDEJ    16    TTTttTttTtTttTTt    JJJFFJFEJEIFEJJF

I'd like to be able to recognize which sample get which SNP calls. I thought the RG tags were supposed to fix the need for having hundreds of different individual bam files. I'm running

Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.18 (r982:295)

I know that similar questions have been asked, but none of the answers have pointed me in the right direction. Thanks in advance for your advice!

mpileup samtools • 6.4k views

ADD COMMENT • link updated 4.5 years ago by wkretzsch • 0 • written 10.9 years ago by thomasvangurp ▴ 20

0

Entering edit mode

I just use samptools mpileup for single sample analaysis but not for multiple samples. Strongly suggest you to use GATK which has no such problem to recognize each of the sample for the downstream variant selections or filtering. It works very good so far with my more than 10 samples.

ADD REPLY • link 10.9 years ago by Tonyzeng ▴ 310

Ram · Answer 1 · 2013-12-05

2

Entering edit mode

10.9 years ago

Ying W ★ 4.3k

Cross posted on samtools-help

Only the BCF output groups samples. The text output does not.

Heng

ADD COMMENT • link 10.9 years ago by Ying W ★ 4.3k

0

Entering edit mode

Thanks A lot for posting this here Ying, sorry for crossposting. I still think it's useful as samtools-help is not as easily searchable as biostars.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.9 years ago by thomasvangurp ▴ 20

Ram · Answer 2 · 2013-12-05

0

Entering edit mode

10.9 years ago

Ashutosh Pandey 12k

RG ID tag is a lane specific tag and is used to distinguish reads from one lane of sequences from another. There is a fixed naming space or limited read ids that different lanes in sequencer use. Two reads, one from each different lane may have same read name. If you need to merge the aligned bam file for each lane, then there can be a problem as reads with same name from different lanes will clash with each other. In this case, people add RG ID that maintains the lane identity of that read. Earlier a sloppy way was to add the lane name to the read name.

Well the way you have used RG ID here makes sense but may not be correct and samtools mpileup function may not interpret in the way you want it to.

ADD COMMENT • link 10.9 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thanks, I thought that the SM tag in in the @RG header was meant to achieve calling variants from a single file. From http://samtools.sourceforge.net/mpileup.shtml:

"SAMtools acquires sample information from the SM tag in the @RG header lines. One alignment file can contain multiple samples; reads from one sample can also be distributed in different alignment files. SAMtools will regroup the reads anyway. In addition, if no @RG lines are present, each alignment file is taken as one sample."

From this I understand that given different SM tag mpileup should consider the samples these refer to to be different and report on these in identical fashion as would be done given separate bam file.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.9 years ago by thomasvangurp ▴ 20

0

Entering edit mode

I am sorry for the confusion. I didnt read your question carefully. What I explained about RG ID is correct but your problem has nothing to do with. Even though you have wrongly used RGID but you have used SM tag correctly and in this case you should get two outputs for each sample. You get two outputs when passed separately but only get one when you supply a merged BAM file. The good thing is that samtools identifies there are two samples in both case. May be the default is a combined output for the merged bam and separate output for separate bams. I looked at mpileup functions and there is -D (Output per-sample read depth) parameter. I am not sure if it will work but can you try it with your merged bam.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.9 years ago by Ashutosh Pandey 12k

score 0 · Answer 3 · 2020-05-05

0

Entering edit mode

4.5 years ago

wkretzsch • 0

I really needed this feature, so I wrote a tool for this: streaming_pileupy can be found at https://pypi.org/project/streaming-pileupy/

ADD COMMENT • link 4.5 years ago by wkretzsch • 0