We have a WES data set which was done using the Agilent Mouse exome capture library kit. I wanted to download the target file and got, similar to this post, a folder with several bed files (_AllTracks.bed, _Covered.bed, _Padded.bed, _Regions.bed and a file named Targets.txt). I am not really sure what they are, but my problem is more than that.
When I try to run the command
gatk BedToIntervalList \ -I input/S0276129_Covered.bed \ -O input/S0276129_Covered.intervals \ --SEQUENCE_DICTIONARY ../reference/mm10/mm10.dict
I get the following error:
picard.PicardException: Start on sequence 'chr1' was past the end: 195471971 < 196469947 at picard.util.BedToIntervalList.doWork(BedToIntervalList.java:143) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305) at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206) at org.broadinstitute.hellbender.Main.main(Main.java:292)
Which based on the message tells me that the bed files show coordinates which are not given in the
dict file for chr1.
This is true, when I look at chromosome 1 in the bed file I see:
grep "chr1\s" input/S0276129_Covered.bed | tail chr1 196986946 196987186 entg|Cr2,ens|ENSMUST00000082321,ref|NM_007... chr1 196989335 196989485 entg|Cr2,ens|ENSMUST00000082321,ref|NM_007...
but t he
dict file shows
less ../reference/mm10/Sequence/WholeGenomeFasta/genome.dict @HD VN:1.0 SO:unsorted ... @SQ SN:chr1 LN:195471971 UR:file:/illumina/scratch/iGenomes/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa M5:c4ec915e7348d42648eefc1534b71c99 ...
When I search for the gene Cr2, its coordinates are Chromosome 1: 195,136,811-195,176,716
Is there something wrong with the bed file from Agilent? Any ideas what is happening?