Reordering Bam files based on chromosome number, when the chromosome format is chr1..etc
1
0
Entering edit mode
3.9 years ago
jparker4 ▴ 20

When trying to call variants I get the error message:

##### ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reads. Please see https://software.broadinstitute.org/gatk/documentation/article?id=1328for more information. Error details: reads contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]

Both my fasta file and my bams have the chromosomes sorted in this order. Therefore I need to re-sort my bams and possibly my fasta to be in numerical order: chr1,chr2 etc.

However, due to the chr in the chromosome name samtools sort does not sort the chromosomes into the desired order. Can anyone help with this sorting?

RNA-Seq SNP • 2.6k views
ADD COMMENT
0
Entering edit mode

It would be useful to mention what program you are using to call variants.

ADD REPLY
0
Entering edit mode

Apologies, that was misleading. It is part of a variant calling pipeline but the actual error occurs when trying to run SplitNCigarReads. The full command I am running is posted below the comment underneath this one.

ADD REPLY
0
Entering edit mode
3.9 years ago

Hello,

in the error message there is a link to the documentation about what is wrong and how you can solve this.

Version-specific alert for GATK 3.5

In version 3.5, we added some beefed-up VCF sequence dictionary validation. Unfortunately, as a side effect of the additional checks, some users have experienced an error that starts with "ERROR MESSAGE: Lexicographically sorted human genome sequence detected in variant." that is due to unintentional activation of a check that is not necessary. This will be fixed in the next release; in the meantime -U ALLOW_SEQ_DICT_INCOMPATIBILITY can be used (with caution) to override the check.

fin swimmer

(Aargh, third edit)

ADD COMMENT
0
Entering edit mode

Hi, thanks for the reply. Unfortunately the error is occurring when I run SplitNCigarReads, which requires the -U option ALLOW_N_CIGAR_READS. The tool will not let me pass two -U options in, so I'm a bit stuck.

Below is the error causing command:

java -Xmx10G -Djava.io.tmpdir=/fastdata/mbp15jdp/tmp -jar ~/Downloads/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar                    -T SplitNCigarReads                     -R /shared/sudlab1/General/mirror/genomes/hisat/hg19.fa                    -I deduped.dir/GTEX-12WSH-Cerebellum.hisat.bam                     -o split.dir/GTEX-12WSH-Cerebellum.hisat.split.bam                     -rf ReassignOneMappingQuality                     -RMQF 255                     -RMQT 60                     -U ALLOW_N_CIGAR_READS
ADD REPLY
0
Entering edit mode

Hello,

ok than we have to resort your contig (that was my first solution before I saw the above information) even if I have no idea why GATK cares about the current order.

1. Sort your reference

$ samtools faidx reference.fa
$ cut -f1 reference.fa.fai|sort -k1V|parallel -k 'samtools faidx reference.fa {}' > ref_sorted.fa

2. Create a dictionary for the sorted ref

$ java -jar picard.jar CreateSequenceDictionary R=ref_sorted.fa O=ref_sorted.dict

3. Reorder bam file

$ java -jar picard.jar ReorderSam I=original.bam O=reordered.bam R=ref_sorted.fa CREATE_INDEX=TRUE

fin swimmer

ADD REPLY

Login before adding your answer.

Traffic: 1196 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6