I have BAM files with RG tag which is same for all samples. I need to add read groups to the BAM files for all samples. Please note these are sample specific bam files. So first, I checked the RGs:
$samtools view -H 4029_PPNI_WGS.bam | grep "^@RG"
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:171115067
@SQ     SN:chr7 LN:159138663
@SQ     SN:chrX LN:155270560
@SQ     SN:chr8 LN:146364022
@SQ     SN:chr9 LN:141213431
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
@SQ     SN:chr12        LN:133851895
@SQ     SN:chr13        LN:115169878
@SQ     SN:chr14        LN:107349540
@SQ     SN:chr15        LN:102531392
@SQ     SN:chr16        LN:90354753
@SQ     SN:chr17        LN:81195210
@SQ     SN:chr18        LN:78077248
@SQ     SN:chr20        LN:63025520
@SQ     SN:chrY LN:59373566
@SQ     SN:chr19        LN:59128983
@SQ     SN:chr22        LN:51304566
@SQ     SN:chr21        LN:48129895
@SQ     SN:chrM LN:16571
@RG     ID:DDGD PL:illumina     LB:HQ   SM:4029
I have same RGID for all samples which is DDGD. 
I was looking at picard tools and this is what they suggested to replace the RGIDs:
 java -jar picard.jar AddOrReplaceReadGroups \
       I=input.bam \
       O=output.bam \
       RGID=4 \
       RGLB=lib1 \
       RGPL=ILLUMINA \
       RGPU=unit1 \
       RGSM=20
If I run the above command, it assigns only one RGID to all read groups in a bam file. What should be my strategy to replace/assign RGIDs in a bam correctly?
I do have read information, but I am not sure how to assign rgID to this bam.
an@virtual-workstation:/WGS/WGS$ samtools view 4029_PPNI_WGS.bam | head
HS2000-1111A_136:4:1303:15669:31420     99      chr1    10000   254     56M1I6M1I6M1I6M1I22M    =       10096   196     CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCCTAAACCCTAAACCCTCAACCCTAACCCTAACCCTAACC    CCCFFFFFGHHHHJJJJJJIIIJJJJIJJJJJJIJJJJGGIJEDFHHIC9FGGJE>D;=DCA(77???################################  BC:Z:0  XD:Z:N55^1$6^1$6^1$6^1$22       SM:i:16 AS:i:420
HS2000-1111A_136:4:1207:4085:83323      163     chr1    10001   254     100M    =       10166   265     TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCTAACCCTAAC    @CCDDFFFHFDFFDHHIIIIJIIIEGHGIIJIJIIHGHGEIIEHDHEFEIGHGHGICC===EC@BDDDF9>=@=C@?B@CDBDBB?C,99>@>(222?C?  BC:Z:0  XD:Z:87C12      SM:i:2  AS:i:953
HS2000-1111A_136:6:2108:7980:36762      99      chr1    10001   65      100M    =       10251   350     TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACACTCACCCTAACCCTAACCCTAACCCTAACCCTAAC    @@@FFFDAHHHBHEHIJJJFFHHGEEHGGGHIGGIIIECBDE;FBF;B(==;@F##############################################  BC:Z:0  XD:Z:64C2A32    SM:i:9  AS:i:65
HS2000-1111A_136:4:2208:20673:80720     99      chr1    10001   254     100M    =       10276   375     TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC    @@?DDDDDBFFBDDAHEGE;?BBF@BFFGBDBDAFGD>?BDFB@;;?FDC;@AFA@DG9@9?;?B=>B;;AC>=CBBB?C???B299?BB8<9A?A<33<  BC:Z:0  XD:Z:100        SM:i:9  AS:i:503
HS2000-1111A_136:5:2202:3274:84881      99      chr1    10002   94      36M1I14M1I6M1I9M2I8M1I21M       =       10098   196     AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCCTAACCCCAACCCCTAACCCCTAACCCCTCACCCCTACCCCCAAACCCCAACCCTAACC    CCCFFBDDFHHHGC@HHIIIIIIIIIIC)@?DCFG3DD*??G2?FDHH0;;;4@@1CC##########################################  AM:i:0  BC:Z:0  XD:Z:36^1$11T2^1$6^1$9^2$T1A5^1$A3T5T10 SM:i:0  AS:i:94
HS2000-1111A_136:5:2210:7977:80403      163     chr1    10002   254     100M    =       10443   541     AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCACCCCTAACC    CCCFFFFFHHHHHJJJJJFIIJIIIJJJIGJEIJJJJIJJJJGGJJJJIGFHHIIJJHIJHHHHFFFDFBCEDCDABD;(5(,555?B?###########  BC:Z:0  XD:Z:89T1A8     SM:i:12 AS:i:913
HS2000-1111A_136:4:1308:2945:46018      99      chr1    10004   254     100M    =       10156   252     CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT    8?=DDFFFFDFFHIIIIIHIIIIIIIIIIGIIGC)B?B88?D@DH;BFHGC=CF;(.=@GFE;2?@B9;2;;>=;2;?229555(9((99ABB8<3(2?8  BC:Z:0  XD:Z:100        SM:i:7  AS:i:876
HS2000-1111A_136:5:2206:3416:11292      99      chr1    10005   254     100M    =       10105   200     CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA    CCCFFFFFHHHHHJJIJJJJJJIIJJIJJGHJGIJIEIIFIEGICHECFH@HIFIICGHEHF6?D>BF>6ACAB9;A?AA<AC5?A9(928?833+8?##  BC:Z:0  XD:Z:100        SM:i:17 AS:i:776
HS2000-1111A_136:6:1316:17054:3007      99      chr1    10005   254     100M    =       10287   382     CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA    CCCFFFFFHHHHHJJIJJJJIJJJJJJJIJJJIJJJJJIIIIJJGGJIJJGGGIJJGIGIHHGHEFFFBCCEEDBA?BBDBC?BDD?C(2?B1(9<AB##  BC:Z:0  XD:Z:100        SM:i:17 AS:i:906
HS2000-1111A_136:4:1115:14938:75430     99      chr1    10006   254     52M2I46M        =       10169   263     CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT    CCCFFFFFHHHHHJJJJJJIIIJJJIIJJJJGGIIJJGHIJJJGGIIII@BGFHJDG<E7ABEBF);AB@;5=AB?A3<AB<?9ABB<?(2<?<C<BBD<  BC:Z:0  XD:Z:52^2$46    SM:i:16 AS:i:501
we cannot answer if we don't know how those group should be assigned (multiple libraries, multiple centers, multiple lanes, etc...)
Hi Pierre, I do have read information as shown in my question (I just updated). Is there a way I could use it?
I can see these could be used as RGIDs. What else do I need to use and how do I do it?
You want to use lane numbers as "Read Groups"?
I think so, because these are unique. What I have right now is one RGID (Project name) for all 1000+ samples
So you would actually be using something like this
That's right. But how do I add three read groups to one bam? I am trying to use picard's AddOrReplaceReadGroups.
@swbarnes2 posted about how to do that with this caveat.
You have read names but do you know which sample each read belongs to? Or the example you show above is just one sample?
If you have 1000 sample specific files, add the read groups to each file and then merge.
Yes it is for just one sample as an example.
I have 1000 samples with same problem. Not 1000 bams for each sample.
For each sample you just need one read group at a minimum which would allow you to merge the 1000 BAM's into one for variant calling?
So for sample 1 you can have
For sample 2
and so on
Thank you. So I don't need three different RGIDs for one sample/bam? Are you saying I can still merge 1000 samples in a joint call by having only one (unique) RGID per sample? I thought markduplicate step requires all read groups defined properly within each bam.
Are you completely sure that each lane should be a separate sample? Sure, samples could be arranged like that, I was suggesting that might be the case, but just because it might be like that doesn't mean it is.