We have a WES data set which was done using the Agilent Mouse exome capture library kit. I wanted to download the target file and got, similar to this post, a folder with several bed files (_AllTracks.bed, _Covered.bed, _Padded.bed, _Regions.bed and a file named Targets.txt). I am not really sure what they are, but my problem is more than that.
When I try to run the command
gatk BedToIntervalList \
  -I input/S0276129_Covered.bed \
  -O input/S0276129_Covered.intervals \
  --SEQUENCE_DICTIONARY ../reference/mm10/mm10.dict
I get the following error:
picard.PicardException: Start on sequence 'chr1' was past the end: 195471971 < 196469947
        at picard.util.BedToIntervalList.doWork(BedToIntervalList.java:143)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
        at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
        at org.broadinstitute.hellbender.Main.main(Main.java:292)
Which based on the message tells me that the bed files show coordinates which are not given in the dict file for chr1.
This is true, when I look at chromosome 1 in the bed file I see:
grep "chr1\s" input/S0276129_Covered.bed |  tail
chr1    196986946       196987186       entg|Cr2,ens|ENSMUST00000082321,ref|NM_007...
chr1    196989335       196989485       entg|Cr2,ens|ENSMUST00000082321,ref|NM_007...
but the dict file shows
less ../reference/mm10/Sequence/WholeGenomeFasta/genome.dict 
@HD     VN:1.0  SO:unsorted
...
@SQ     SN:chr1 LN:195471971    UR:file:/illumina/scratch/iGenomes/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa   M5:c4ec915e7348d42648eefc1534b71c99
...
When I search for the gene Cr2, its coordinates are Chromosome 1: 195,136,811-195,176,716
Is there something wrong with the bed file from Agilent?
Any ideas what is happening?
Thanks
I have this same issue - did you ever end up finding a fix?
wrong reference genome.
As in a HG38 vs HG39 kind of difference - or as in using the Ensemble version vs the UCSC genome provided file as I know they annotate differently? I am not quite sure the source of the Bed files from Agilent.
But after doing some checking I do believe its because my dict file is based on the Ensemble genome file while these BED files may be coming from somewhere else.
There is no "HG39". There's hg19, hg38 and T2T-CHM13.
Please do not add answers unless you're answering the top level question. Instead, use
Add CommentorAdd Replyas appropriate. A moderator has moved your post to the right location this time, please be more careful in the future.