This is a problem that bugged me for a few days. Figure the solution might be helpful to others:
I was seeing the following error message when I tried to run GATK:
##### ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reference. ##### ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs. ##### ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files. ##### ERROR You can use the ReorderSam utility to fix this problem: http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersam ##### ERROR reference contigs = [chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr1, chr20, chr21, chr22, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
I understood that I should be able fix this problem using Picard ReorderSam.jar, but the problem was that the .bam file seemed properly ordered yet GATK didn't recognize this to be the case.
Here are the commands that I used for pre-processing:
bwa mem $reference $read1 $read2 > $sam samtools view -bS $sam > $bam #remove unaligned reads samtools view -F 0x04 -b $bam > $filtered_bam"; #sort sample java -jar /opt/picard-tools-1.105/SortSam.jar I=$filtered_bam O=$sorted_bam SORT_ORDER=coordinate CREATE_INDEX=True #MarkDuplicates java -jar /opt/picard-tools-1.105/MarkDuplicates.jar INPUT=$sorted_bam OUTPUT=$nodup_bam METRICS_FILE=$metrics_file REMOVE_DUPLICATES=true #add read groups java -jar /opt/picard-tools-1.105/AddOrReplaceReadGroups.jar INPUT=$nodup_bam OUTPUT=$rg_bam RGLB=1 RGPL=illumina RGPU=barcode RGSM=test CREATE_INDEX=True #reorder sample java -jar /opt/picard-tools-1.105/ReorderSam.jar" I=$rg_bam O=$karyotype_bam REFERENCE=$karotypic_fasta CREATE_INDEX=True #start running GATK java -jar /opt/GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T RealignerTargetCreator -R $reference -I $karyotype_bam -o $target_intervals
I converted the final .bam file to a .sam file, and I viewed the beginning of the file to confirm that the chromosomes are in the correct order in the header and I also checked the first few reads. Everything seemed to be OK.
Solution: I was trying to use the Karyotype reference to reorder an alignment from a normal hg19 reference. Running the pipeline from the beginning with only the Karyotype reference provided by GATK solved the problem
#In other words, $karotypic_fasta should be used instead of $reference, so the BWA command should be as follows bwa index -a bwtsw $karotypic_fasta bwa mem $karotypic_fasta $read1 $read2 > $sam