How treat a Bam file with mutiple read groups (RG)
1
0
Entering edit mode
5.1 years ago

Hi everyone, I am pretty new to the NGS data analysis. I have downloaded a WES dataset at BAM file format from SRA database. I try to mark duplicates in bam file using MarkDuplicates command at Picard . However, I encountered an error in terminal "error parsing sam header. @rg line missing sm tag". I tough that there is an error in @RG tag. So I run a command in terminal i.e. "samtools view -H SRR1693634_NC_000005.9.sorted.bam | grep '@RG' to see rg tag". Output is below

@RG ID:FGC0630.4.ACTGAT
@RG ID:FGC0639.8.ACTGAT
@RG ID:FGC0639.7.ACTGAT
@RG ID:FGC0639.4.ACTGAT
@RG ID:FGC0639.6.ACTGAT
@RG ID:FGC0639.5.ACTGAT

Now I have two questions:

  1. What is meaning of the several read groups for a single sample? Are MarkDuplicates results for a sample with single RG are similar to the that sample with multiple RGs?

  2. RGs in my sample lacks some necessary information such as RGID, RGLB, RGPL, RGPU, RGSM. How can I obtain and then add these information to the bam file for each RG? As you know AddOrReplaceReadGroups in Picard only treats sample with a single RG.

next-gen BAM GATK picard read group • 3.4k views
ADD COMMENT
1
Entering edit mode
5.1 years ago

. What is meaning of the several read groups for a single sample?

https://software.broadinstitute.org/gatk/documentation/article?id=6472

Are MarkDuplicates results for a sample with single RG are similar to the that sample with multiple RGs?

it depends: imagine two read for the same sample but made of two distinct DNA libraries : they cannot be considered as duplicate.

RGPL, RGPU, RGSM

Adding read group to bam files from multiplexed samples

most of the time, you only need SM (sample name) but it's always better if you can fill all the informations. see https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups

How can I obtain and then add these information to the bam file for each RG?

extract the sam header with samtools view -H

use sed to add the information to the header,something like:

sed '/^@RG/s/FGC0630.4.ACTGAT/FGC0630.4.ACTGAT\tSM:sample1/'

use samtools reheader to update the bam.

ADD COMMENT
0
Entering edit mode

Many thanks Pierre.
As you see, I have six RGs. Does it mean that the sample was sequenced in six lanes of flowcell and then these six outputs were merged into unique BAM file? Do I think right? If yes, how Picard treat this merged BAM file to mark duplicates? Does it separately treated each read group or treat all RGs simultaneously?

ADD REPLY

Login before adding your answer.

Traffic: 2000 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6