Question: mm10: Sequence length difference between interval file and reference
0
gravatar for Nandini
3 months ago by
Nandini550
London
Nandini550 wrote:

hello, I have mouse exome data, for which I have downloaded the reference from ucsc. I am trying to create an interval file from Agilent SureSelect

This is the command I am using

 java -jar  picard.jar BedToIntervalList I=S0276129_Covered.bed O=S0276129_Covered.interval_list SD=mm10_genome.dict

This is the error I get

chr1 was past the end: 195471971 < 196469947
chr5 was past the end 151834684 < 151842168
chr7 was past the end: 145441459 < 145451439
chr8 was past the end: 129401213 < 129458847
chr12 was past the end: 120129022 < 120129244
chr14 was past the end: 124902244 < 125075837
chr16 was past the end: 98207768 < 98218510
chr17 was past the end: 94987271 < 95126542
chr18 was past the end: 90702639 < 90702728

Any idea on why the sequence length differ from the dictionary file and the interval file and how can I correct this ?

many thanks,

ADD COMMENTlink modified 3 months ago by Pierre Lindenbaum108k • written 3 months ago by Nandini550
2
gravatar for Pierre Lindenbaum
3 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum108k wrote:

the error is here: https://github.com/broadinstitute/picard/blob/master/src/main/java/picard/util/BedToIntervalList.java#L158

may be you're using the wrong mm10_genome.dict , or the S0276129_Covered.bed contains coordinates that overflow the 'dict' file. .

use awk to remove the bad lines ?

e.g:

awk '(($1=="chr1" && int($3)<= 195471971) || ($2=="chr5" && int($3) <= 151834684 ))' in.bed > out.bed
ADD COMMENTlink modified 3 months ago • written 3 months ago by Pierre Lindenbaum108k

Thanks Pierre. The sequence dictionary has been created using the fasta file and picard's CreateSequenceDictionary function, is there a possibility of that going wrong ?

Also, the bad lines can be removed but will this affect the "target areas" for variant calling ?

ADD REPLYlink modified 3 months ago • written 3 months ago by Nandini550

. The sequence dictionary has been created using the fasta file and picard's CreateSequenceDictionary function, is there a possibility of that going wrong ?

no , so the problem would com from from your bed (it is mm10 ?)

Also, the bad lines can be removed but will this affect the "target areas" for variant calling ?

We have not idea about the way you're going to use this interval file, which tool ?

you can always trim the bed.

(...) if($1=="chr1") printf("%s\t%d\t%d\n",$1,$2,($3 <= 195471971 ?$3:195471971));(....)
ADD REPLYlink written 3 months ago by Pierre Lindenbaum108k

Yes its mm10. This interval file is being prepared to be used for GATK variant calling.

ADD REPLYlink written 3 months ago by Nandini550
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1588 users visited in the last hour