Hi all!
I am following Latest GATK blogs to do WGS data analysis.However, on MarkDuplicates steps I am stuck because of the following error
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing SAM header. @RG line missing SM tag. Line:
@RG ID:AD0772_S2_L004 LB:L004 PL:ILLUMINA PU:HK522DSXX; File /scratch/parashar/align_tc/AD0772_S2_L004_sort.bam; Line number 95
at htsjdk.samtools.SAMTextHeaderCodec.reportErrorParsingLine(SAMTextHeaderCodec.java:258)
at htsjdk.samtools.SAMTextHeaderCodec.access$200(SAMTextHeaderCodec.java:46)
at htsjdk.samtools.SAMTextHeaderCodec$ParsedHeaderLine.requireTag(SAMTextHeaderCodec.java:358)
at htsjdk.samtools.SAMTextHeaderCodec.parseRGLine(SAMTextHeaderCodec.java:168)
at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:110)
at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:704)
at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:396)
at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:220)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:533)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:257)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
I checked my sam file to ensure if the header line was added during alignment and yes it was there:
$grep "@RG" AD0772_S2_L004_aln.sam
output: @RG ID:AD0772_S2_L004 LB:L004 PL:ILLUMINA PU:HK522DSXX
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -R @RG\tID:AD0772_S2_L004\tLB:L004\tPL:ILLUMINA\tPU:HK522DSXX /home/parashar/archive/Megha/bwa_0.7.17/hg19.fa /home/parashar/archive/wgs/raw_data/AD0772_S2_L004_R1_001.fastq.gz /home/parashar/archive/wgs/raw_data/AD0772_S2_L004_R2_001.fastq.gz
I also checked the sorted file that I was using as Input to mark duplicates as:
$samtools view AD0772_S2_L004_sort.bam | grep "@RG"
It furnishes the same result.
I am not able to figure out why it us happening!!
The command I used for MarkDuplicate is:
cat samlist.txt | parallel --max-procs=5 "picard MarkDuplicates I={}_sort.bam O={}_dedup.bam M=mark_dup_metrics.txt ASSUME_SORTED=true 2> {}.stderr"
My inputs are working fine and command is running. Using Picard version V-2.21.1-0
Note:I produced my sorted file using Sambamba and bam files using samtools
parallel is cool but you'd better use a workflow manager (snakemake, nextflow, etc...)