Question: How treat a Bam file with mutiple read groups (RG)
0
gravatar for zamani_2012
18 months ago by
zamani_20120 wrote:

Hi everyone, I am pretty new to the NGS data analysis. I have downloaded a WES dataset at BAM file format from SRA database. I try to mark duplicates in bam file using MarkDuplicates command at Picard . However, I encountered an error in terminal "error parsing sam header. @rg line missing sm tag". I tough that there is an error in @RG tag. So I run a command in terminal i.e. "samtools view -H SRR1693634_NC_000005.9.sorted.bam | grep '@RG' to see rg tag". Output is below

@RG ID:FGC0630.4.ACTGAT
@RG ID:FGC0639.8.ACTGAT
@RG ID:FGC0639.7.ACTGAT
@RG ID:FGC0639.4.ACTGAT
@RG ID:FGC0639.6.ACTGAT
@RG ID:FGC0639.5.ACTGAT

Now I have two questions:

  1. What is meaning of the several read groups for a single sample? Are MarkDuplicates results for a sample with single RG are similar to the that sample with multiple RGs?

  2. RGs in my sample lacks some necessary information such as RGID, RGLB, RGPL, RGPU, RGSM. How can I obtain and then add these information to the bam file for each RG? As you know AddOrReplaceReadGroups in Picard only treats sample with a single RG.

bam read group picard next-gen gatk • 1.4k views
ADD COMMENTlink modified 18 months ago by Pierre Lindenbaum130k • written 18 months ago by zamani_20120
1
gravatar for Pierre Lindenbaum
18 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

. What is meaning of the several read groups for a single sample?

https://software.broadinstitute.org/gatk/documentation/article?id=6472

Are MarkDuplicates results for a sample with single RG are similar to the that sample with multiple RGs?

it depends: imagine two read for the same sample but made of two distinct DNA libraries : they cannot be considered as duplicate.

RGPL, RGPU, RGSM

Adding read group to bam files from multiplexed samples

most of the time, you only need SM (sample name) but it's always better if you can fill all the informations. see https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups

How can I obtain and then add these information to the bam file for each RG?

extract the sam header with samtools view -H

use sed to add the information to the header,something like:

sed '/^@RG/s/FGC0630.4.ACTGAT/FGC0630.4.ACTGAT\tSM:sample1/'

use samtools reheader to update the bam.

ADD COMMENTlink written 18 months ago by Pierre Lindenbaum130k

Many thanks Pierre.
As you see, I have six RGs. Does it mean that the sample was sequenced in six lanes of flowcell and then these six outputs were merged into unique BAM file? Do I think right? If yes, how Picard treat this merged BAM file to mark duplicates? Does it separately treated each read group or treat all RGs simultaneously?

ADD REPLYlink written 18 months ago by zamani_20120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1549 users visited in the last hour