Hi everyone, I am pretty new to the NGS data analysis. I have downloaded a WES dataset at BAM file format from SRA database. I try to mark duplicates in bam file using MarkDuplicates command at Picard . However, I encountered an error in terminal "error parsing sam header. @rg line missing sm tag". I tough that there is an error in @RG tag. So I run a command in terminal i.e. "samtools view -H SRR1693634_NC_000005.9.sorted.bam | grep '@RG' to see rg tag". Output is below
@RG ID:FGC0630.4.ACTGAT
@RG ID:FGC0639.8.ACTGAT
@RG ID:FGC0639.7.ACTGAT
@RG ID:FGC0639.4.ACTGAT
@RG ID:FGC0639.6.ACTGAT
@RG ID:FGC0639.5.ACTGAT
Now I have two questions:
What is meaning of the several read groups for a single sample? Are MarkDuplicates results for a sample with single RG are similar to the that sample with multiple RGs?
RGs in my sample lacks some necessary information such as RGID, RGLB, RGPL, RGPU, RGSM. How can I obtain and then add these information to the bam file for each RG? As you know AddOrReplaceReadGroups in Picard only treats sample with a single RG.
Many thanks Pierre.
As you see, I have six RGs. Does it mean that the sample was sequenced in six lanes of flowcell and then these six outputs were merged into unique BAM file? Do I think right? If yes, how Picard treat this merged BAM file to mark duplicates? Does it separately treated each read group or treat all RGs simultaneously?