I'd like to run mpileup on a bam file with two different samples. The samples are identified in the header in the following way:
@RG ID:ApeKI_s2.sort SM:ApeKI_s2.sort LB:citrus_ApeKI PL:ILLUMINA
@RG ID:ApeKI_s12.sort SM:ApeKI_s12.sort LB:citrus_ApeKI PL:ILLUMINA
I added the RG tag to all entries in the bam file using
samtools merge -rh header.sam merged.bam ApeKI_s12.sort.bam ApeKI_s2.sort.bam
The RG tags are present in the reads:
ApeKI_common_s2_5511537 99 scaffold00001 244 60 25S44M3D27M = 244 83 CTGCTTCCTTTATGTTCGTGCATTCTTCTTACTCTGATTTAGATGGCGAAGGTTTTAAGCTAACTTTTTATGTCAATTTTTAAATGGGTTTCTAAT FFFFHHHHHJJJJJJJJJHIJJJJJJJJJJJJJJJJJJJJJJJJIIJJJJJJFHIJJJJJIJJJJHHHHHHEFFFFFFEEEDEEEEDDABDDDEEE NM:i:7 AS:i:42 XS:i:0 RG:Z:ApeKI_s2.sort
ApeKI_common__2474514 147 scaffold00001 244 60 16S44M3D36M4S = 244 -83 TTATGTTCGTGCATTCTTCTTACTCTGATTTAGATGGCGAAGGTTTTAAGCTAACTTTTTATGTCAATTTTTAAATGGGTTTCTAATACTGTTCAAGCTG DDDDDDDDDDDDEEEEFFFFD@HHHHHHIJJIJJJJJJJJJJJJJJJIHGGJIIJJJIJIIIJIHJJJJJJJJJJIJJJJJJIJIIHFHHHHDFFFFC@C NM:i:8 AS:i:46 XS:i:0 RG:Z:ApeKI_s12.sort
when I run samtools mpileup on merged.bam I get this:
[mpileup] 2 samples in 1 input files
<mpileup> Set max per-file depth to 4000
scaffold00001 244 N 38 ^[T^[T^[T^]T^]T^]T^]t^]t^[T^]t^]t^]T^]t^]T^]t^]t^]T^[T^]t^]t^]T^]t^]T^]T^]t^]t^]t^]t^]T^]t^]T^]T^]t^]t^]T^]t^]t^]T JJJHIJFFJFDJDJFDJJFEJEJJEDFEJDHJFFJFEJ
scaffold00001 245 N 38 TTTTTTttTttTtTttTTttTtTTttttTtTTttTttT JJJIGJFFJFDJDJFDIJFEJEJJFFFDJDJJFFIFCJ
scaffold00001 246 N 38 CCCCCCccCccCcCccCCccCcCCccccCcCCccCccC JJJGJJFFJFDJDJFDJJFEJEJIFCDEJCJJFFJFDI
scaffold00001 247 N 38 TTTTTTttTttTtTttTTttTtTTttttTtTTttTttT JJJDJJFFJFEJEIFEJJFDJEJICEEEJEJJFFJDEJ
So samtools recognizes that there are indeed 2 samples in my input file. I would however expect to get SNP information on both lines as in the following output obtained by running:
samtools mpileup ApeKI_s2.bam ApeKI_s12.bam |head
[mpileup] 2 samples in 2 input files
<mpileup> Set max per-file depth to 4000
scaffold00001 244 N 22 ^]T^]T^]T^]t^]T^]t^]T^]T^]t^]t^]t^]t^]T^]t^]T^]T^]t^]t^]T^]t^]t^]T HIJEJEJJEDFEJDHJFFJFEJ 16 ^[T^[T^[T^]t^]t^[T^]t^]t^]T^]t^]T^]t^]t^]T^[T^]t JJJFFJFDJDJFDJJF
scaffold00001 245 N 22 TTTtTtTTttttTtTTttTttT IGJEJEJJFFFDJDJJFFIFCJ 16 TTTttTttTtTttTTt JJJFFJFDJDJFDIJF
scaffold00001 246 N 22 CCCcCcCCccccCcCCccCccC GJJEJEJIFCDEJCJJFFJFDI 16 CCCccCccCcCccCCc JJJFFJFDJDJFDJJF
scaffold00001 247 N 22 TTTtTtTTttttTtTTttTttT DJJDJEJICEEEJEJJFFJDEJ 16 TTTttTttTtTttTTt JJJFFJFEJEIFEJJF
I'd like to be able to recognize which sample get which SNP calls. I thought the RG tags were supposed to fix the need for having hundreds of different individual bam files. I'm running
Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.18 (r982:295)
I know that similar questions have been asked, but none of the answers have pointed me in the right direction. Thanks in advance for your advice!
I just use samptools mpileup for single sample analaysis but not for multiple samples. Strongly suggest you to use GATK which has no such problem to recognize each of the sample for the downstream variant selections or filtering. It works very good so far with my more than 10 samples.