I need to extract exactly 50bp upstream and downstream of a class of features dispersed across the genome, concatenate every pair to make a chimeric DNA sequence, up to 100bp in length.
Another biostars forum page suggested using a "slop | subtract | getfasta" pipeline of bedtools.
slopBed -s -i orig.bed -g genome.fai -b 50 > SLOP50.bed
slop works OK but
subtractBed -s -a <slop.bed> -b <original.bed>
yields output that appears to overlap by 1bp across all lines.
And at subtract tools page - https://bedtools.readthedocs.io/en/latest/content/tools/subtract.html the examples also show this overlap by 1bp.
$ cat A.bed
chr1 10 20
chr1 100 200
$ cat B.bed
chr1 0 30
chr1 180 300
$ bedtools subtract -a A.bed -b B.bed
chr1 100 180
But shouldn't the output, in this example, ideally be
bedtools subtract -a A.bed -b B.bed
chr1 100 179
Or am I misunderstanding something?
If not, then my chimera will have 2 additional nt - 1 each from each flanking end, which is not desirable.
So, is there a simple workaround or add-on to achieve what I have set out to do - make my 2 * 50bp chimeric DNA sequence? Thanks!
This is cross posted at https://github.com/arq5x/bedtools2/issues/680. So feel free to answer here or there. Thanks!
Actual bed files line snippets are shown below
ORIG
Chr_01 342230 360778 Aco_TE_F_1-0,15:5 15:5 + HSv1.1 TE . ID=Aco_TE_F_1-0,15:5
Chr_01 1494010 1496609 Aco_TE_R_1-0,10:7 10:7 - HSv1.1 TE . ID=Aco_TE_R_1-0,10:7
SLOP50
Chr_01 342180 360828 Aco_TE_F_1-0,15:5 15:5 + HSv1.1 TE . ID=Aco_TE_F_1-0,15:5
Chr_01 1493960 1496659 Aco_TE_R_1-0,10:7 10:7 - HSv1.1 TE . ID=Aco_TE_R_1-0,10:7
SUBTRACT = SLOP50 - ORIGINAL
Chr_01 342180 342230 Aco_TE_F_1-0,15:5 15:5 + HSv1.1 TE . ID=Aco_TE_F_1-0,15:5
Chr_01 360778 360828 Aco_TE_F_1-0,15:5 15:5 + HSv1.1 TE . ID=Aco_TE_F_1-0,15:5
Chr_01 1493960 1494010 Aco_TE_R_1-0,10:7 10:7 - HSv1.1 TE . ID=Aco_TE_R_1-0,10:7
Chr_01 1496609 1496659 Aco_TE_R_1-0,10:7 10:7 - HSv1.1 TE . ID=Aco_TE_R_1-0,10:7