Question

mm10: Sequence length difference between interval file and reference

0

Entering edit mode

6.1 years ago

NB ▴ 960

hello, I have mouse exome data, for which I have downloaded the reference from ucsc. I am trying to create an interval file from Agilent SureSelect

This is the command I am using

 java -jar  picard.jar BedToIntervalList I=S0276129_Covered.bed O=S0276129_Covered.interval_list SD=mm10_genome.dict

This is the error I get

chr1 was past the end: 195471971 < 196469947
chr5 was past the end 151834684 < 151842168
chr7 was past the end: 145441459 < 145451439
chr8 was past the end: 129401213 < 129458847
chr12 was past the end: 120129022 < 120129244
chr14 was past the end: 124902244 < 125075837
chr16 was past the end: 98207768 < 98218510
chr17 was past the end: 94987271 < 95126542
chr18 was past the end: 90702639 < 90702728

Any idea on why the sequence length differ from the dictionary file and the interval file and how can I correct this ?

many thanks,

mm10 sequence length ucsc mouse reference • 1.4k views

ADD COMMENT • link updated 6.1 years ago by Pierre Lindenbaum 161k • written 6.1 years ago by NB ▴ 960

score 2 · Accepted Answer · 2018-03-13

2

Entering edit mode

6.1 years ago

Pierre Lindenbaum 161k

the error is here: https://github.com/broadinstitute/picard/blob/master/src/main/java/picard/util/BedToIntervalList.java#L158

may be you're using the wrong mm10_genome.dict , or the S0276129_Covered.bed contains coordinates that overflow the 'dict' file. .

use awk to remove the bad lines ?

e.g:

awk '(($1=="chr1" && int($3)<= 195471971) || ($2=="chr5" && int($3) <= 151834684 ))' in.bed > out.bed

ADD COMMENT • link 6.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks Pierre. The sequence dictionary has been created using the fasta file and picard's CreateSequenceDictionary function, is there a possibility of that going wrong ?

Also, the bad lines can be removed but will this affect the "target areas" for variant calling ?

ADD REPLY • link 6.1 years ago by NB ▴ 960

0

Entering edit mode

. The sequence dictionary has been created using the fasta file and picard's CreateSequenceDictionary function, is there a possibility of that going wrong ?

no , so the problem would com from from your bed (it is mm10 ?)

Also, the bad lines can be removed but will this affect the "target areas" for variant calling ?

We have not idea about the way you're going to use this interval file, which tool ?

you can always trim the bed.

(...) if($1=="chr1") printf("%s\t%d\t%d\n",$1,$2,($3 <= 195471971 ?$3:195471971));(....)

ADD REPLY • link 6.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Yes its mm10. This interval file is being prepared to be used for GATK variant calling.

ADD REPLY • link 6.1 years ago by NB ▴ 960