Bedops split bed coordinates with zero based half opened format
1
1
Entering edit mode
8 months ago
greyman ▴ 160

I wonder if I made a mistakes in the code below, I tried to split my long 3k bases promoter sequence extracted using bedops into 1000nt shorter sequence with --chop but it ends up with an extra sequence which only has 1 base in it. I am doing this because STREME seemed to performed better when sequence length not greater 1000 bases according to the tutorial.

Example range of interest:

ND_123.1:74801-76802

bedops --everything promoters.forward.bed promoters.reverse.bed | bedops --chop 1000 -

I have stucked in this piece of code for a few days. Hope to get some help.

Undesired output

ND_123.1:74801-75801

ND_123.1:75801-76801

ND_123.1:76801-76802 <== The one base only sequence

To solve it I think I can probably do the following code on the forward strands:

awk -vFS="\t" -vOFS="\t" '($6 == "+"){ print$1, ($2 - 1),$2, $4,$5, $6; }' gene.bed | bedops --range -$WINDOW:-1 --everything - > promoters.forward.bed

And also change the strand on reverse strands from 0:$WINDOW to 1:$WINDOW ... is this making sense?

DNA bedops STREME • 466 views
0
Entering edit mode

Is filtering out small intervals (like 1bp sequence) help?

And I think in your code, for the forward strand you need to subtract 1 from end of the peak

awk -vFS="\t" -vOFS="\t" '($6 == "+"){ print$1, $2, ($3-1), $4,$5, $6; }' gene.bed | bedops --range -$WINDOW:-1 --everything - > promoters.forward.bed

1
Entering edit mode

Hi , I would like to use bedops to make a standardize workflow.

In this post I saw the author of bedops did this before merging both files.

gff2bed < genes.gff > genes.bed \
awk -vFS="\t" -vOFS="\t" '($6 == "+"){ print$1, ($2 - 1),$2, $4,$5, $6; }' genes.bed \ | bedops --range -400:50 --everything - \ > promoters.for.bed awk -vFS="\t" -vOFS="\t" '($6 == "-"){ print $1,$3, ($3 + 1),$4, $5,$6; }' genes.bed \
| bedops --range -50:400 --everything - \
> promoters.rev.bed

3
Entering edit mode
8 months ago

I do see a one-base snippet:

% echo -e "ND_123.1\t74801\t76802" | bedops --chop 1000 -
ND_123.1    74801   75801
ND_123.1    75801   76801
ND_123.1    76801   76802


If we treat ND_123.1:74801-76802 as a zero-based-indexed interval, then it is 2001 bases long:

% calc '76802-74801'
2001


It won't divide into 1kb windows neatly, so something is up with how this interval is generated.

One cheap way to fix this is to use -x with --chop <N>. This extra option filters out intervals that are not N bases long:

% echo -e "ND_123.1\t74801\t76802" | bedops --chop 1000 -x -
ND_123.1    74801   75801
ND_123.1    75801   76801


But that is hiding the problem. I wouldn't use this.

Another is to perhaps pre-process these intervals so that they are sized in 1k units:

% echo -e "ND_123.1\t74801\t76802" | awk -v FS="\t" -v OFS="\t" '{ print $1,$2, ($3 - 1) }' | bedops --chop 1000 - ND_123.1 74801 75801 ND_123.1 75801 76801  But that just hides the problem in a different way. Another way would be to look at what the awk part of the one-liner is doing: % gff2bed < genes.gff | awk -v FS="\t" -v OFS="\t" '($6 == "+"){ print $1, ($2 - 1), $2,$4, $5,$6; }'


The gff2bed script changes GFF's one-based index to zero-based in the BED-formatted output. So the above should be a correct way of getting a set of single-base, forward-stranded TSS intervals.

In any case, let's say our TSS of interest is at 0-indexed position ND_123.1:76801:76802. This TSS starts at position 76801 on contig ND_123.1, is one base in length, and is oriented on the forward strand.

We want the 2kb-sized window upstream of this position, oriented on the forward strand.

There are two different sets of windows we can construct.

One window does not include the TSS interval:

% echo -e "ND_123.1\t76801\t76802" | bedops -u --range -2000:-1 -
ND_123.1    74801   76801


This is the solution you post in your question:

... | bedops --range -\$WINDOW:-1 --everything - | ...


The other window includes the TSS interval:

% echo -e "ND_123.1\t76801\t76802" | bedops -u --range -1999:0 -
ND_123.1    74802   76802


Either way, both of these windows are 2kb long. Chopping either of these by 1kb increments gives exactly two 1kb windows:

% echo -e "ND_123.1\t76801\t76802" | bedops -u --range -1999:0 - | bedops --chop 1000 -
ND_123.1    74802   75802
ND_123.1    75802   76802

% echo -e "ND_123.1\t76801\t76802" | bedops -u --range -2000:-1 - | bedops --chop 1000 -
ND_123.1    74801   75801
ND_123.1    75801   76801


So which of these two solutions you pick depends upon whether you want the TSS interval in the output, or not. That will depend on what you're doing with these 1kb windows, probably.

Note that the --range operation will need to be flipped in the case of reverse-strand oriented TSSs. If we treat the example as reverse strand-oriented, then we might do either of the following:

% echo -e "ND_123.1\t76801\t76802" | bedops -u --range 1:2000 -
ND_123.1    76802   78802

% echo -e "ND_123.1\t76801\t76802" | bedops -u --range 0:1999 -
ND_123.1    76801   78801


Which parameters are used depends on whether you want to exclude or include the TSS, respectively.

Tl;dr: If what I posted in a previous answer about making specifically-sized windows is different from the above, then my previous answer is incorrect and should not be used. However, the answer that greyman refers to in his comment will include the TSS and genomic space both upstream and downstream of the TSS. It might be correct in that use case, such as for calculating aggregate signal over the TSS, where it can make sense to define 400 bases upstream, one base for the TSS, and 50 bases downstream. In that case, one might line up signal vectors and take the mean (or some other statistic), and it doesn't really matter in that case that there is an "extra" base. (And, in fact, keeping the TSS in helps with plotting.) What matters here — for your use case — is that you need a precise size of windows for chopping up, so your solution will work.

1
Entering edit mode

It is a very detailed explanation, thank you!