Question

How can I fix picard error: Sequences at index 0 don't match

0

Entering edit mode

5.5 years ago

Niek De Klein ★ 2.6k

This is a cross-post from here: https://gatkforums.broadinstitute.org/gatk/discussion/13395/sequences-at-index-0-dont-match-but-using-the-same-reference-genome#latest

I am trying to run CollectMultipleMetrics from Picard tools on a CRAM file but I get an "Sequences at index 0 don't match, but using the same reference genome". What can I do to resolve this?

Below what I have tried so far:

I am trying to run CollectMultipleMetrics on a CRAM file with the following code:

java -jar picard.jar CollectMultipleMetrics I=SRR1378155_SRR1378155.cram O=SRR1378155_SRR1378155.multiplemetrics R=/apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa

but I get the error

htsjdk.samtools.util.SequenceUtil$SequenceListsDifferException: Sequences at index 0 don't match: 0/248956422/chr1/M5=2648ae1bacce4ec4b6cf337dcae37816/UR=/apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa 0/248956422/chr1/M5=1f9b4bfb2d6193a45e52901b8aa4339e/UR=file:/apps/data/ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa.gz

I am not sure where the "UR=file:/apps/data/ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa.gz" comes from, as when I look in the CRAM header it shows "UR:/apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa" for the contigs.

I tried resorting using the reference from last post here: http://seqanswers.com/forums/archive/index.php/t-14635.html with

 java -jar picard.jar ReorderSam INPUT=SRR1378155_SRR1378155.cram  OUTPUT=SRR1378155_SRR1378155.sorted.cram REFERENCE=/apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa

And tried it suing the .gzipped reference file as well, but for both I got the same sequence don't match error.

When I use ValidateSAM on the CRAM file, I get "Exception in thread "main" htsjdk.samtools.cram.CRAMException: Contig chr1 not found in the reference file.", from answer here https://gatkforums.broadinstitute.org/gatk/discussion/7944/can-validatesamfile-check-on-cram-files I converted my CRAM to BAM file and then ran validateSAM, which gave me:

HISTOGRAM java.lang.String

Error Type Count
ERROR:MISSING_READ_GROUP 1
WARNING:RECORD_MISSING_READ_GROUP 124248788

I then added the read group to the BAM file and after that there are no errors found in the BAM file, but when I try to use CollectMultipleMetrics I again get the "Sequences at index 0 don't match" error.

First couple of lines with CRAM header, shows UR:/apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa:

@HD VN:1.5 SO:coordinate
@PG ID:STAR PN:STAR VN:STAR_2.5.1b CL:STAR --runThreadN 8 --genomeDir /apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/STAR/2.5.1b-foss-2015b/ --genomeLoad NoSharedMemory --readFilesIn /groups/umcg-biogen/tmp04/biogen/input/GTEx/sra/fastq/cart_prj8709_201809140901/SRR1378155_1.fastq.gz /groups/umcg-biogen/tmp04/biogen/input/GTEx/sra/fastq/cart_prj8709_201809140901/SRR1378155_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix /local/5210636/SRR1378155_SRR1378155. --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMismatchNmax 6 --quantMode GeneCounts --twopassMode Basic
@PG ID:scramble PN:scramble PP:STAR VN:1.14.6 CL:scramble -I bam -O cram -r /apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa -t 2 /local/5211312/SRR1378155_SRR1378155.fixmates.bam /groups//umcg-biogen/tmp04/biogen/input/GTEx/pipelines/results///cramFiles/SRR1378155_SRR1378155.cram
@CO user command line: STAR --outFileNamePrefix /local/5210636/SRR1378155_SRR1378155. --readFilesIn /groups/umcg-biogen/tmp04/biogen/input/GTEx/sra/fastq/cart_prj8709_201809140901/SRR1378155_1.fastq.gz /groups/umcg-biogen/tmp04/biogen/input/GTEx/sra/fastq/cart_prj8709_201809140901/SRR1378155_2.fastq.gz --readFilesCommand zcat --genomeDir /apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/STAR/2.5.1b-foss-2015b/ --genomeLoad NoSharedMemory --runThreadN 8 --outFilterMultimapNmax 1 --outFilterMismatchNmax 6 --twopassMode Basic --quantMode GeneCounts --outSAMunmapped Within
@SQ SN:chr1 LN:248956422 M5:2648ae1bacce4ec4b6cf337dcae37816 UR:/apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa

picard GATK • 4.0k views

ADD COMMENT • link 5.5 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

Hello,

R=/apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa

Is there realy a folder called appsat the root level? What's the output of ls -l /apps?
You have two / in the path (Don't know if this is a problem).

fin swimmer

ADD REPLY • link 5.5 years ago by finswimmer 16k

0

Entering edit mode

Yes the path exists and the double / is not a problem. If the reference file can not be found it gives this error: Exception in thread "main" htsjdk.samtools.SAMException: Error opening file: /fake/path/genome.fa

ADD REPLY • link 5.5 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

Hello again,

could you post the content of the index file for the reference genome (the one ending wit .fai)? If not already done index it by samtools faidx /apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa

fin swimmer

ADD REPLY • link 5.5 years ago by finswimmer 16k

0

Entering edit mode

head -n30 /apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa.fai

chr1    248956422   8   60  61
chr2    242193529   253105712   60  61
chr3    198295559   499335808   60  61
chr4    190214555   700936301   60  61
chr5    181538259   894321107   60  61
chr6    170805979   1078885012  60  61
chr7    159345973   1252537766  60  61
chr8    145138636   1414539514  60  61
chr9    138394717   1562097136  60  61
chr10   133797422   1702798442  60  61
chr11   135086622   1838825832  60  61
chr12   133275309   1976163908  60  61
chr13   114364328   2111660483  60  61
chr14   107043718   2227930894  60  61
chr15   101991189   2336758684  60  61
chr16   90338345    2440449737  60  61
chr17   83257441    2532293732  60  61
chr18   80373285    2616938808  60  61
chr19   58617616    2698651658  60  61
chr20   64444167    2758246245  60  61
chr21   46709983    2823764492  60  61
chr22   50818468    2871252985  60  61
chrX    156040895   2922918436  60  61
chrY    57227415    3081560021  60  61
chrM    16569   3139741236  60  61
GL000008.2  209709  3139758105  60  61
GL000009.2  201709  3139971333  60  61
GL000194.1  191469  3140176427  60  61
GL000195.1  182896  3140371111  60  61
GL000205.2  185591  3140557079  60  61

ADD REPLY • link 5.5 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

This looks good.

The documentation for CollectMultipleMetrics doesn't mention that it supports cram. But you wrote you also have converted it to bam and added a ReadGroup to the header. Could you please provide the complete commands you have used and show the complete header of the final bam?

fin swimmer

ADD REPLY • link 5.5 years ago by finswimmer 16k

1

Entering edit mode

The problem was in the .dict file, which was apparently made using the gzipped reference genome fasta, while the CRAM was made using the unzipped reference genome fasta. I remade the .dict file with the unzipped genome fasta file and now it works.

Thanks for the help!

ADD REPLY • link 5.5 years ago by Niek De Klein ★ 2.6k