MarkDuplicates showing error, not recognising SM tag present in bam Header
4.4 years ago

Hi all!

I am following Latest GATK blogs to do WGS data analysis.However, on MarkDuplicates steps I am stuck because of the following error

Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing SAM header. @RG line missing SM tag. Line:
@RG     ID:AD0772_S2_L004       LB:L004 PL:ILLUMINA     PU:HK522DSXX; File /scratch/parashar/align_tc/AD0772_S2_L004_sort.bam; Line number 95
        at htsjdk.samtools.SAMTextHeaderCodec.reportErrorParsingLine(
        at htsjdk.samtools.SAMTextHeaderCodec.access$200(
        at htsjdk.samtools.SAMTextHeaderCodec$ParsedHeaderLine.requireTag(
        at htsjdk.samtools.SAMTextHeaderCodec.parseRGLine(
        at htsjdk.samtools.SAMTextHeaderCodec.decode(
        at htsjdk.samtools.BAMFileReader.readHeader(
        at htsjdk.samtools.BAMFileReader.<init>(
        at htsjdk.samtools.BAMFileReader.<init>(
        at htsjdk.samtools.SamReaderFactory$
        at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(
        at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(
        at picard.sam.markduplicates.MarkDuplicates.doWork(
        at picard.cmdline.CommandLineProgram.instanceMain(
        at picard.cmdline.PicardCommandLine.instanceMain(
        at picard.cmdline.PicardCommandLine.main(

I checked my sam file to ensure if the header line was added during alignment and yes it was there:

$grep "@RG" AD0772_S2_L004_aln.sam 
output: @RG ID:AD0772_S2_L004   LB:L004 PL:ILLUMINA PU:HK522DSXX
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -R @RG\tID:AD0772_S2_L004\tLB:L004\tPL:ILLUMINA\tPU:HK522DSXX /home/parashar/archive/Megha/bwa_0.7.17/hg19.fa /home/parashar/archive/wgs/raw_data/AD0772_S2_L004_R1_001.fastq.gz /home/parashar/archive/wgs/raw_data/AD0772_S2_L004_R2_001.fastq.gz

I also checked the sorted file that I was using as Input to mark duplicates as:

$samtools view AD0772_S2_L004_sort.bam | grep "@RG"

It furnishes the same result.

I am not able to figure out why it us happening!!

The command I used for MarkDuplicate is:

cat samlist.txt | parallel --max-procs=5 "picard MarkDuplicates I={}_sort.bam O={}_dedup.bam M=mark_dup_metrics.txt ASSUME_SORTED=true 2> {}.stderr"  

My inputs are working fine and command is running. Using Picard version V-2.21.1-0

Note:I produced my sorted file using Sambamba and bam files using samtools

cat samlist.txt | parallel --max-procs=5

parallel is cool but you'd better use a workflow manager (snakemake, nextflow, etc...)

4.4 years ago

yes there is a '@RG' but there is not sample (SM) associated to that RG.

SM = Sample The name of the sample sequenced in this read group.
I beg my pardon for the trouble. But you saved the day. Thanks. Seems my test samples had SM tag. I somehow forgot to add SM tag to the control one AD0772_S2_L004_aln.sam.


