I have two bed files : the first (1) one determinate features on my genome and the second one (2) is the output of the sequencing after cleavage on the same genome.
I want to know if the sequences of (2) are contained in the features of (1).
If they are not, I want to print the sequence in an output file (3).
If they are I want to print the half of the sequences in the same output file (3).
I did a perl script but it takes 25hours to run because I have 16millions sequences in (2).
So I was wondering if someone could give me some clue to build a bash script, shorter and faster than a perl script.
This sounds like a candidate for BEDTools. First, use intersectBed to find which features in BED file2 are contained in a feature from BED file 1. Then run it again using the -v option to get the inverse set of features in BED file 2. You can extract the sequences of the resulting features from a reference using fastaFromBed. I'm not sure what you mean by "print the half of the sequences in the same output file", but the BEDTools programs will deal with the hard part efficiently (minutes rather than 25 hours!) and you can use some Perl to tidy up.