Question: Gatk Has Issues With Sorted Bam File
3
gravatar for Oliver
8.2 years ago by
Oliver230
Berlin, Germany
Oliver230 wrote:

Hello folks,

I try to establish the following pipeline stated here in this board:

http://biostar.stackexchange.com/questions/1269/what-is-the-best-pipeline-for-human-whole-exome-sequencing

Since time changes and so did the parameter for the tools, I had to do some changes to the commands (see the foot of this posting). But the pipeline has issues with one of the last two steps. When I run the IndelGenotyper:

java -jar /opt/biosw/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -T IndelGenotyperV2 -R ../reference/hg19_chr15.fa -I output/xxx.sorted.realigned.bam -o output/xxx_indel.txt -verbose output/xxx_indel_statistics.txt

it breaks with an error message:

Missorted Input SAM/BAM file SAMFileReader{output/xxx.sorted.realigned.bam}: file sorted in coordinate order but coordinate is required; Read XXX:2:1:10032:2323#0 out of order on the contig

and that I may use ReorderSam from the picard tools to get an appropriate sorting. Confusingly, the following step in the pipeline runs well with this bam file:

java -jar /opt/biosw/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ../reference/hg19_chr15.fa -I output/xxx.resorted.realigned.bam -o output/xxx.vcf.calls -stand_call_conf 30.0 -stand_emit_conf 10.0

Question here: Why does the GATK takes this file in one step but reject the same file in the other?

So I used the picard tool ReorderSam to reorder the BAM file after the 7. step in the pipeline:

java -jar /opt/biosw/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -T IndelRealigner -R ../reference/hg19_chr15.fa -I output/xxx.sorted.bam -targetIntervals output/xxx.intervals -o output/xxx.sorted.realigned.bam
java -jar /opt/biosw/picard-tools-1.45/ReorderSam.jar I=output/xxx.sorted.realigned.bam O=output/xxx.resorted.realigned.bam REFERENCE=../reference/hg19_chr15.fa

Sadly, this changed nothing, I keep getting this error message:

Missorted Input SAM/BAM file SAMFileReader{output/xxx.resorted.realigned.bam}: file sorted in coordinate order but coordinate is required; Read XXX:2:1:10032:2323#0 out of order on the contig

I also tried to resort it with the samtools:

samtools sort output/xxx.sorted.realigned.bam output/xxx.resorted.realigned

Now more confusingly the file runs well with the first command which failed in the previous step (IndelRealignerV2) but fails in the last step (UnifiedGenotyper) with this error message:

SAM/BAM file SAMFileReader{output/xxx.resorted.realigned.bam} is malformed: Read XXX:2:1:6409:5134#0 is either missing the read group or its read group is not defined in the BAM header, both of which are required by the GATK

So I could do two sortings, keeping the old one for the UnifiedGenotyper and sort with the samtools again for the IndeRealignerV2, but I don't think this is an acceptable workaround. So does anyone has an idea what the point of the GATK toolkit is?

As you may have noticed, I took only a part of the whole genome to speed up the pipeline (we want to run it with less data, then we could use the whole set). I can't image that this is the issue since its only about to order some entries. But I'm not sure.

Any help appreciated, Oliver

Here my current pipeline:

 1.  bwa aln -f output/xxx.sai -t 4 ../reference/hg19_chr15.fa fastq/xxx.fastq
 2.  bwa samse -f output/xxx.sam ../reference/hg19_chr15.fa output/xxx.sai fastq/xxx.fastq
 3.  samtools import ../reference/hg19_chr15.fa output/xxx.sam output/xxx.bam
 4.  samtools sort output/xxx.bam output/xxx.sorted
 5.  samtools index output/xxx.sorted.bam output/xxx.sorted.bam.bai
 6.  java -jar /opt/biosw/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ../reference/hg19_chr15.fa -I output/xxx.sorted.bam -o output/xxx.intervals
 7.  java -jar /opt/biosw/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -T IndelRealigner -R ../reference/hg19_chr15.fa -I output/xxx.sorted.bam -targetIntervals output/xxx.intervals -o output/xxx.sorted.realigned.bam
(8.a samtools sort output/xxx.sorted.realigned.bam output/xxx.resorted.realigned)
(8.b java -jar /opt/biosw/picard-tools-1.45/ReorderSam.jar I=output/xxx.sorted.realigned.bam O=output/xxx.resorted.realigned.bam REFERENCE=../reference/hg19_chr15.fa)
 9.  samtools index output/xxx.resorted.realigned.bam output/xxx.resorted.realigned.bam.bai
10.  java -jar /opt/biosw/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -T IndelGenotyperV2 -R ../reference/hg19_chr15.fa -I output/xxx.resorted.realigned.bam -o output/xxx_indel.txt -verbose output/xxx_indel_statistics.txt
11.  java -jar /opt/biosw/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ../reference/hg19_chr15.fa -I output/xxx.resorted.realigned.bam -o output/xxx.vcf.calls -stand_call_conf 30.0 -stand_emit_conf 10.0
ADD COMMENTlink written 8.2 years ago by Oliver230
8
gravatar for Russh
8.2 years ago by
Russh1.2k
U. Liverpool
Russh1.2k wrote:

Had this yesterday don't use 'samtools sort' use picard tools 'SortSam' at that step, with 'SORT_ORDER=coordinate' you have to use 'coordinate' to get the right header. I believe 'samtools sort' is equivalent to 'picard: SortSam ... SORT_ORDER=queryname' and GATK fails with 'queryname' sorted files. Well, samtools better be equivalent to that method, because it puts 'queryname' in the sort method flag of the resulting .sam file

For me, picard tools is a lot neater than samtools in a lot of ways

For example, Depending on your plans in GATK, you might find that you require 'readgroups' in your bam file which aren't currently present. These can be added, and the bam can be sorted, in one step using the picard tools 'AddOrReplaceReadGroups' program

All the best,

Russ, Liverpool

ADD COMMENTlink written 8.2 years ago by Russh1.2k
2

Sorry, I ploughed into answering your question without reading the final half of it. You definitely need AddOrReplaceReadGroups from picard tools. Use that at an early stage and you should fly through unified genotyper when it comes around

ADD REPLYlink written 8.2 years ago by Russh1.2k
1

you could use it at step 4, because you can use AddOrReplace... to both add read groups, sort your bam (using SORT_ORDER=coordinate) and index the bam at the same time (CREATE_INDEX=TRUE).

I don't think picard likes unmapped reads to be present in your bam, so you might have to filter them out prior to running this step.

I recently used this java -jar ~/tools/picard-tools-1.45/AddOrReplaceReadGroups.jar INPUT=realigned.sorted.temp.bam OUTPUT=realigned.sorted.temp2.bam SORT_ORDER=coordinate RGLB=1 RGPL=solid RGPU=1 RGSM=ID439_01 VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

ADD REPLYlink written 8.2 years ago by Russh1.2k

Hello Russ, thank you for your answer. This AddOrReplaceReadGroups thing worked well, at least with a fraction of the reads (the whole set of reads crashs for some reason I have to find out). Can you tell what it exactly does and what you mean with early stage? We put it right in between 7 and 9 in the above pipeline (without any of the 8* steps). According to the description of the picard tools, it replaces all read groups in the INPUT file with a new read group and assigns all reads to this read group in the OUTPUT. I can't really think of what that exactly means.

ADD REPLYlink written 8.2 years ago by Oliver230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1347 users visited in the last hour