How to assign read groups to bam files?
1
1
Entering edit mode
3.5 years ago
MAPK ★ 2.1k

I have BAM files with RG tag which is same for all samples. I need to add read groups to the BAM files for all samples. Please note these are sample specific bam files. So first, I checked the RGs:

$samtools view -H 4029_PPNI_WGS.bam | grep "^@RG"


@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:171115067
@SQ     SN:chr7 LN:159138663
@SQ     SN:chrX LN:155270560
@SQ     SN:chr8 LN:146364022
@SQ     SN:chr9 LN:141213431
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
@SQ     SN:chr12        LN:133851895
@SQ     SN:chr13        LN:115169878
@SQ     SN:chr14        LN:107349540
@SQ     SN:chr15        LN:102531392
@SQ     SN:chr16        LN:90354753
@SQ     SN:chr17        LN:81195210
@SQ     SN:chr18        LN:78077248
@SQ     SN:chr20        LN:63025520
@SQ     SN:chrY LN:59373566
@SQ     SN:chr19        LN:59128983
@SQ     SN:chr22        LN:51304566
@SQ     SN:chr21        LN:48129895
@SQ     SN:chrM LN:16571
@RG     ID:DDGD PL:illumina     LB:HQ   SM:4029

I have same RGID for all samples which is DDGD.

I was looking at picard tools and this is what they suggested to replace the RGIDs:

 java -jar picard.jar AddOrReplaceReadGroups \
       I=input.bam \
       O=output.bam \
       RGID=4 \
       RGLB=lib1 \
       RGPL=ILLUMINA \
       RGPU=unit1 \
       RGSM=20

If I run the above command, it assigns only one RGID to all read groups in a bam file. What should be my strategy to replace/assign RGIDs in a bam correctly?

I do have read information, but I am not sure how to assign rgID to this bam.

an@virtual-workstation:/WGS/WGS$ samtools view 4029_PPNI_WGS.bam | head
HS2000-1111A_136:4:1303:15669:31420     99      chr1    10000   254     56M1I6M1I6M1I6M1I22M    =       10096   196     CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCCTAAACCCTAAACCCTCAACCCTAACCCTAACCCTAACC    CCCFFFFFGHHHHJJJJJJIIIJJJJIJJJJJJIJJJJGGIJEDFHHIC9FGGJE>D;=DCA(77???################################  BC:Z:0  XD:Z:N55^1$6^1$6^1$6^1$22       SM:i:16 AS:i:420
HS2000-1111A_136:4:1207:4085:83323      163     chr1    10001   254     100M    =       10166   265     TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCTAACCCTAAC    @CCDDFFFHFDFFDHHIIIIJIIIEGHGIIJIJIIHGHGEIIEHDHEFEIGHGHGICC===EC@BDDDF9>=@=C@?B@CDBDBB?C,99>@>(222?C?  BC:Z:0  XD:Z:87C12      SM:i:2  AS:i:953
HS2000-1111A_136:6:2108:7980:36762      99      chr1    10001   65      100M    =       10251   350     TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACACTCACCCTAACCCTAACCCTAACCCTAACCCTAAC    @@@FFFDAHHHBHEHIJJJFFHHGEEHGGGHIGGIIIECBDE;FBF;B(==;@F##############################################  BC:Z:0  XD:Z:64C2A32    SM:i:9  AS:i:65
HS2000-1111A_136:4:2208:20673:80720     99      chr1    10001   254     100M    =       10276   375     TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC    @@?DDDDDBFFBDDAHEGE;?BBF@BFFGBDBDAFGD>?BDFB@;;?FDC;@AFA@DG9@9?;?B=>B;;AC>=CBBB?C???B299?BB8<9A?A<33<  BC:Z:0  XD:Z:100        SM:i:9  AS:i:503
HS2000-1111A_136:5:2202:3274:84881      99      chr1    10002   94      36M1I14M1I6M1I9M2I8M1I21M       =       10098   196     AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCCTAACCCCAACCCCTAACCCCTAACCCCTCACCCCTACCCCCAAACCCCAACCCTAACC    CCCFFBDDFHHHGC@HHIIIIIIIIIIC)@?DCFG3DD*??G2?FDHH0;;;4@@1CC##########################################  AM:i:0  BC:Z:0  XD:Z:36^1$11T2^1$6^1$9^2$T1A5^1$A3T5T10 SM:i:0  AS:i:94
HS2000-1111A_136:5:2210:7977:80403      163     chr1    10002   254     100M    =       10443   541     AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCACCCCTAACC    CCCFFFFFHHHHHJJJJJFIIJIIIJJJIGJEIJJJJIJJJJGGJJJJIGFHHIIJJHIJHHHHFFFDFBCEDCDABD;(5(,555?B?###########  BC:Z:0  XD:Z:89T1A8     SM:i:12 AS:i:913
HS2000-1111A_136:4:1308:2945:46018      99      chr1    10004   254     100M    =       10156   252     CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT    8?=DDFFFFDFFHIIIIIHIIIIIIIIIIGIIGC)B?B88?D@DH;BFHGC=CF;(.=@GFE;2?@B9;2;;>=;2;?229555(9((99ABB8<3(2?8  BC:Z:0  XD:Z:100        SM:i:7  AS:i:876
HS2000-1111A_136:5:2206:3416:11292      99      chr1    10005   254     100M    =       10105   200     CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA    CCCFFFFFHHHHHJJIJJJJJJIIJJIJJGHJGIJIEIIFIEGICHECFH@HIFIICGHEHF6?D>BF>6ACAB9;A?AA<AC5?A9(928?833+8?##  BC:Z:0  XD:Z:100        SM:i:17 AS:i:776
HS2000-1111A_136:6:1316:17054:3007      99      chr1    10005   254     100M    =       10287   382     CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA    CCCFFFFFHHHHHJJIJJJJIJJJJJJJIJJJIJJJJJIIIIJJGGJIJJGGGIJJGIGIHHGHEFFFBCCEEDBA?BBDBC?BDD?C(2?B1(9<AB##  BC:Z:0  XD:Z:100        SM:i:17 AS:i:906
HS2000-1111A_136:4:1115:14938:75430     99      chr1    10006   254     52M2I46M        =       10169   263     CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT    CCCFFFFFHHHHHJJJJJJIIIJJJIIJJJJGGIIJJGHIJJJGGIIII@BGFHJDG<E7ABEBF);AB@;5=AB?A3<AB<?9ABB<?(2<?<C<BBD<  BC:Z:0  XD:Z:52^2$46    SM:i:16 AS:i:501
BAM • 5.7k views
ADD COMMENT
0
Entering edit mode

we cannot answer if we don't know how those group should be assigned (multiple libraries, multiple centers, multiple lanes, etc...)

ADD REPLY
0
Entering edit mode

Hi Pierre, I do have read information as shown in my question (I just updated). Is there a way I could use it?

I can see these could be used as RGIDs. What else do I need to use and how do I do it?

HS2000-1111A_136:4
HS2000-1111A_136:5
HS2000-1111A_136:5
ADD REPLY
0
Entering edit mode

You want to use lane numbers as "Read Groups"?

ADD REPLY
0
Entering edit mode

I think so, because these are unique. What I have right now is one RGID (Project name) for all 1000+ samples

ADD REPLY
1
Entering edit mode

So you would actually be using something like this

@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     SM:SAMPLE1 
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     SM:SAMPLE1 
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     SM:SAMPLE10
ADD REPLY
0
Entering edit mode

That's right. But how do I add three read groups to one bam? I am trying to use picard's AddOrReplaceReadGroups.

ADD REPLY
0
Entering edit mode

@swbarnes2 posted about how to do that with this caveat.

You have read names but do you know which sample each read belongs to? Or the example you show above is just one sample?

If you have 1000 sample specific files, add the read groups to each file and then merge.

ADD REPLY
0
Entering edit mode

Yes it is for just one sample as an example.

ADD REPLY
0
Entering edit mode

I have 1000 samples with same problem. Not 1000 bams for each sample.

ADD REPLY
1
Entering edit mode

For each sample you just need one read group at a minimum which would allow you to merge the 1000 BAM's into one for variant calling?

So for sample 1 you can have

@RG     ID:FLOWCELL1     PL:ILLUMINA     SM:SAMPLE1

For sample 2

@RG     ID:FLOWCELL2      PL:ILLUMINA     SM:SAMPLE2

and so on

@RG     ID:FLOWCELL50      PL:ILLUMINA     SM:SAMPLE1000
ADD REPLY
0
Entering edit mode

Thank you. So I don't need three different RGIDs for one sample/bam? Are you saying I can still merge 1000 samples in a joint call by having only one (unique) RGID per sample? I thought markduplicate step requires all read groups defined properly within each bam.

ADD REPLY
0
Entering edit mode

Are you completely sure that each lane should be a separate sample? Sure, samples could be arranged like that, I was suggesting that might be the case, but just because it might be like that doesn't mean it is.

ADD REPLY
0
Entering edit mode
3.5 years ago

You will probably have to split the file into multiple separate bams, add the read groups, then merge them back together. The read names might tell you which read belongs to which sample.

EDIT: If things were not done right, the smartest thing to do is to stop and redo them correctly yourself. If you aren't completely sure that you can split the bams into sample-specific bams correctly, get the original fastqs, and start over.

ADD COMMENT
0
Entering edit mode

Hi @swbarnes, Is this a valid approach?

Split input bam file by flowcell lane:
# for lane 6
samtools view -H infile.bam > header.sam
(cat header.sam; samtools view input.bam | grep 'HS2000-1015_160:6') | samtools view -S  > lane6.bam

# then add read group
PICARD="/usr/local/genome/picard_latest/picard.jar"
 java -jar ${PICARD} AddOrReplaceReadGroups \
       I=lane6.bam \
       O=output_lane6.bam \
       RGID=${flowcell}.6 \
       RGLB=lib1 \
       RGPL=ILLUMINA \
       RGPU=unit1 \
       RGSM=20

# then merge all bam files
samtools merge output_lane6.bam output_lane7.bam output_lane8.bam

Also, Is there a way to quickly extract flowcell lane information from bam file?

ADD REPLY
0
Entering edit mode

You need the answer this important question posed by @swbarnes 24 days ago in this thread : C: How to assign read groups to bam files?

What is the relationship of samples and lanes? Have the samples not been demultiplexed (i.e. you don't have a BAM per sample). How can you tell the samples apart if you just have a giant BAM file per flowcell.

ADD REPLY
0
Entering edit mode

@genomax I do have a BAM per sample. I just need to add read groups to those bam files.

ADD REPLY
1
Entering edit mode

If you have a BAM per sample then you could simply use

@RG     ID:SAMPLE2      PL:ILLUMINA     SM:SAMPLE2

If a same sample was run on multiple lanes over different FC then you could use

@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB1      SM:SAMPLE1 
@RG     ID:FLOWCELL2.LANE5      PL:ILLUMINA     LB:LIB1      SM:SAMPLE1

If you have 2 BAM files for SAMPLE1 (from one library) that are lane specific for each file.

This is described on GATK read groups help page.

ADD REPLY
0
Entering edit mode

Yes, it's one bam per sample, but ran on multiple lanes of the same flowcell. I wasn't sure about the commands I was using.

ADD REPLY
1
Entering edit mode

If you have one bam per sample, what problem are you having? Just run the replace read groups command for each file one at a time with a loop. It is perfectly common for one sample to be run across multiple lanes.

ADD REPLY
0
Entering edit mode

@genomax I have a follow up question based on the link. I think 2 lanes on the same flowcell have less variation than 2 lanes on different flowcells, but the effect is still there. BroadInstitute thinks you get better error correction going by lane, so wouldn't you consider using RG per lane also for one bam per sample ? I am sure we can test this by doing the BQSR plots with AnalyzeCovariates.

ADD REPLY
0
Entering edit mode

You have evidence of lane differences within your samples, or you are just presuming there are differences? I don't think you should be worrying about lanes at all.

ADD REPLY
0
Entering edit mode

@swbarnes2 No I was just guessing if batch correction is needed even within the same flowcells. I have not tested it yet, but I was told to do RG per lane in my lab.

ADD REPLY

Login before adding your answer.

Traffic: 1835 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6