We have a WES data set which was done using the Agilent Mouse exome capture library kit. I wanted to download the target file and got, similar to this post, a folder with several bed files (_AllTracks.bed, _Covered.bed, _Padded.bed, _Regions.bed and a file named Targets.txt). I am not really sure what they are, but my problem is more than that.
When I try to run the command
gatk BedToIntervalList \
-I input/S0276129_Covered.bed \
-O input/S0276129_Covered.intervals \
--SEQUENCE_DICTIONARY ../reference/mm10/mm10.dict
I get the following error:
picard.PicardException: Start on sequence 'chr1' was past the end: 195471971 < 196469947
at picard.util.BedToIntervalList.doWork(BedToIntervalList.java:143)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
at org.broadinstitute.hellbender.Main.main(Main.java:292)
Which based on the message tells me that the bed files show coordinates which are not given in the dict
file for chr1.
This is true, when I look at chromosome 1 in the bed file I see:
grep "chr1\s" input/S0276129_Covered.bed | tail
chr1 196986946 196987186 entg|Cr2,ens|ENSMUST00000082321,ref|NM_007...
chr1 196989335 196989485 entg|Cr2,ens|ENSMUST00000082321,ref|NM_007...
but t he dict
file shows
less ../reference/mm10/Sequence/WholeGenomeFasta/genome.dict
@HD VN:1.0 SO:unsorted
...
@SQ SN:chr1 LN:195471971 UR:file:/illumina/scratch/iGenomes/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa M5:c4ec915e7348d42648eefc1534b71c99
...
When I search for the gene Cr2, its coordinates are Chromosome 1: 195,136,811-195,176,716
Is there something wrong with the bed file from Agilent? Any ideas what is happening?
thanks