Question

GATK, SAM file doesn't have any read groups defined in the header

6

Entering edit mode

10.7 years ago

Chirag Nepal ★ 2.4k

Hi all,

I have been trying to use Mutect to compare results from Varscan and other tools. To run MuTect, pre-processing from GATK and Picard tools is necessary.

1. Mapped reads using BWA.

2. Convert to sorted BAM using PICARD

java -Xmx4g \
  -Djava.io.tmpdir=/tmp \
  -jar SortSam.jar \
  SO=coordinate \
  INPUT=Trimmed_ERR361938_trimmed_bwa.sam \
  OUTPUT=Test.bam \
  VALIDATION_STRINGENCY=LENIENT \
  CREATE_INDEX=true

3. Mark Duplicates using PICARD

java -Xmx4g \
  -Djava.io.tmpdir=/tmp \
  -jarpicard-tools-1.119/SortSam.jar \
  SO=coordinate \
  INPUT=Trimmed_ERR361938_trimmed_bwa.sam \
  OUTPUT=Test.bam \
  VALIDATION_STRINGENCY=LENIENT \
  CREATE_INDEX=true

4. Realign along INDEL using GATK

java -Xmx4g \
  -jar GenomeAnalysisTK.jar \
  -T RealignerTargetCreator \
  -R /steno-internal/chirag/data/indexGenome/hg19/bwa/hg19.fa \
  -o input.bam.list \
  -I input.marked.bam

NOW I GET ERROR

##### ERROR
##### ERROR MESSAGE: SAM/BAM file input.marked.bam is malformed: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups
##### ERROR

There is this script which should fix this, but I am not sure of some of the parameter used here,

java -jar ~/unixTools/picard-tools-1.119/AddOrReplaceReadGroups.jar

These parameters need to be used

RGLB=String
LB=String Read Group Library Required.
RGPU=String
PU=String Read Group platform unit (eg. run barcode) Required.
RGSM=String
SM=String Read Group sample name Required.

How do I get information on these parameters, as I am analyzing many published reads.

Are there some other ways to fix this step.

Thanks in advance!

SAM GATK • 29k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by Chirag Nepal ★ 2.4k

2

Entering edit mode

From my experience, you'll face three obstacles in succession as you embark on this journey (between steps 3 and 4 from your question). Possible tools you might need as you go:

Picard's AddOrReplaceReadGroups to add read group info to the BAM file
Picard's NormalizeFasta to normalize the fasta ref file
Picard's CreateSequenceDictionary to create a .dict for the fasta ref

ADD REPLY • link 10.7 years ago by Ram 45k

1

Entering edit mode

Thanks !

I am following these steps. Number-3 is fine, have not yet encountered step-2. Step-1 is where i am having problem.

ADD REPLY • link 10.7 years ago by Chirag Nepal ★ 2.4k

1

Entering edit mode

FYI (In case you have not seen it before),

http://gatkforums.broadinstitute.org/discussion/3059/lane-library-sample-and-cohort-what-do-they-mean-and-why-are-they-important

PS: Huh , just realized that its an old post :-)

ADD REPLY • link 10.0 years ago by Ashutosh Pandey 12k

2

Entering edit mode

10.0 years ago

ido.sloma ▴ 20

replace STR with @RG\tID:foo\tSM:bar

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 10.0 years ago by ido.sloma ▴ 20

0

Entering edit mode

This is a 3 year old solved question :)

ADD REPLY • link 10.0 years ago by Ram 45k

1

Entering edit mode

But, still, thanks to @ido.sloma for sharing the answer.

ADD REPLY • link 10.0 years ago by Chirag Nepal ★ 2.4k

0

Entering edit mode

Agreed. Just letting them know. Looking at the context now, it probably should be a reply to your comment.

ADD REPLY • link 10.0 years ago by Ram 45k

Ram · Accepted Answer · 2014-10-16

8

Entering edit mode

10.7 years ago

Pierre Lindenbaum 166k

you have to specify the read group from the beginning using the option -R of bwa

-R STR Complete read group header line. '\t' can be used in STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is '@RG\tID:foo\tSM:bar'. [null]

you can add a group to your current bams using picard AddOrReplaceReadGroups http://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

For first option: So, one option is re-mapping again by adding the command: bwa sampe -r STR

For second option: using AddOrReplaceReadGroups, there are 4 parameters that need to be filled.

LB=String Read Group Library Required.
PU=String Read Group platform unit (eg. run barcode) Required.
SM=String Read Group sample name Required.

Can you suggest what parameter should I use, simply to bypass the error

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 10.7 years ago by Chirag Nepal ★ 2.4k

3

Entering edit mode

I'd go with

RGLB=LaneX RGPU=NONE RGSM=AnySampleName

ADD REPLY • link 5.4 years ago by Ram 45k

0

Entering edit mode

Thanks RamRS !

Will try this.

ADD REPLY • link 10.7 years ago by Chirag Nepal ★ 2.4k

0

Entering edit mode

HI @RamRS

Now I started picard/GATK right from sam file.

#! /bin/bash

hg19Seq=/steno-internal/chirag/data/indexGenome/hg19/bwa/hg19.fa

echo "Step-1"
#java -Xmx4g -Djava.io.tmpdir=/tmp -jar /usr/local/lib/picard-tools-1.85/SortSam.jar SO=coordinate INPUT=Trimmed_ERR361938_trimmed_bwa.sam OUTPUT=Test.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

echo "Step-2"
#java -jar /usr/local/lib/picard-tools-1.85/AddOrReplaceReadGroups.jar I=Test.bam O=Test_ADDOrReplace.bam RGPL=illumina RGLB=LaneX RGPU=NONE RGSM=AnySampleName

echo "Step-3"
java -Xmx4g -Djava.io.tmpdir=/tmp -jar /usr/local/lib/picard-tools-1.85/MarkDuplicates.jar INPUT=Test_ADDOrReplace.bam OUTPUT=Test_ADDOrReplace_marked.bam METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT

echo "Step-4"
echo "make list for realignment"
#java -Xmx4g -jar ~/unixTools/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T RealignerTargetCreator -R $hg19Seq -I Test_ADDOrReplace_marked.bam -o input.bam.list

echo "Step-5"
echo "realign along INDEL"
#java -Xmx4g -Djava.io.tmpdir=/tmp -jar ~/unixTools/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -R $hg19Seq -T IndelRealigner -targetIntervals input.bam.list -I Test_ADDOrReplace_marked.bam -o Test_ADDOrReplace_marked_realigned.bam

echo "Step-5"
#java -Djava.io.tmpdir=/tmp/flx-auswerter -jar ~/unixTools/picard-tools-1.119/FixMateInformation.jar INPUT=input.marked.realigned.bam OUTPUT=input_bam.marked.realigned.fixed.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

Get error in Step-3:-

[Fri Oct 17 13:50:53 CEST 2014] net.sf.picard.sam.MarkDuplicates done. Elapsed time: 5.31 minutes.
Runtime.totalMemory()=2648702976
FAQ:  http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page
Exception in thread "main" net.sf.samtools.util.RuntimeEOFException: Premature EOF; BinaryCodec in readmode; file: /NextGenSeqData/project-data/chirag/rawReadsFastQ/chanOn_et_al_2013_liverFluke/Test_ADDOrReplace.bam
    at net.sf.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:373)
    at net.sf.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:357)
    at net.sf.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:200)
    at net.sf.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:558)
    at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:532)
    at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:522)
    at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:481)
    at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:672)
    at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:650)
    at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:386)
    at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:150)
    at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
    at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:134)

Any suggestion on how i could solve this.

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 10.7 years ago by Chirag Nepal ★ 2.4k

1

Entering edit mode

Hi,

Looks like your BAM file might be truncated. You might wanna check the Test_ADDOrReplace.bam file using Picard's ValidateSamFile

Manual: http://broadinstitute.github.io/picard/command-line-overview.html#ValidateSamFile

EDIT: I'd suggest checking the Test_ADDOrReplace.bam and if you find it truncated, start validating each file in the pipeline, starting from the raw BAM file you got from bwa.

ADD REPLY • link 3.4 years ago by Ram 45k

0

Entering edit mode

@Pierre:

I tried this:

bwa sampe -r STR $hg19Index ${name%1.fastq}1.sai ${name%1.fastq}2.sai $name ${name%1.fastq}2.fastq > ${name%1.fastq}trimmed_bwa.sam

and get error

[bwa_sai2sam_pe] malformated @RG line

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 10.7 years ago by Chirag Nepal ★ 2.4k

score 7 · Accepted Answer · 2014-10-16

Hi,

SAM/BAM files need a bit of preprocessing before Picard and GATK can work on them. I'm unable to recall all the steps off the top of my head, but this should help you solve the first problem by adding read groups:

Adding Read Groups To Bam Files

Remember, dummy read group names will suffice to bypass this error.