mm10: Sequence length difference between interval file and reference
3.6 years ago
Nandini ▴ 930

hello, I have mouse exome data, for which I have downloaded the reference from ucsc. I am trying to create an interval file from Agilent SureSelect

This is the command I am using

 java -jar  picard.jar BedToIntervalList I=S0276129_Covered.bed O=S0276129_Covered.interval_list SD=mm10_genome.dict


This is the error I get

chr1 was past the end: 195471971 < 196469947
chr5 was past the end 151834684 < 151842168
chr7 was past the end: 145441459 < 145451439
chr8 was past the end: 129401213 < 129458847
chr12 was past the end: 120129022 < 120129244
chr14 was past the end: 124902244 < 125075837
chr16 was past the end: 98207768 < 98218510
chr17 was past the end: 94987271 < 95126542
chr18 was past the end: 90702639 < 90702728


Any idea on why the sequence length differ from the dictionary file and the interval file and how can I correct this ?

many thanks,

3.6 years ago

may be you're using the wrong mm10_genome.dict , or the S0276129_Covered.bed contains coordinates that overflow the 'dict' file. .

use awk to remove the bad lines ?

e.g:

awk '(($1=="chr1" && int($3)<= 195471971) || ($2=="chr5" && int($3) <= 151834684 ))' in.bed > out.bed

Thanks Pierre. The sequence dictionary has been created using the fasta file and picard's CreateSequenceDictionary function, is there a possibility of that going wrong ?

Also, the bad lines can be removed but will this affect the "target areas" for variant calling ?

. The sequence dictionary has been created using the fasta file and picard's CreateSequenceDictionary function, is there a possibility of that going wrong ?

no , so the problem would com from from your bed (it is mm10 ?)

Also, the bad lines can be removed but will this affect the "target areas" for variant calling ?

We have not idea about the way you're going to use this interval file, which tool ?

you can always trim the bed.

(...) if($1=="chr1") printf("%s\t%d\t%d\n",$1,$2,($3 <= 195471971 ?\$3:195471971));(....)

Yes its mm10. This interval file is being prepared to be used for GATK variant calling.