Question

Bedops split bed coordinates with zero based half opened format

1

Entering edit mode

2.9 years ago

greyman ▴ 190

I wonder if I made a mistakes in the code below, I tried to split my long 3k bases promoter sequence extracted using bedops into 1000nt shorter sequence with --chop but it ends up with an extra sequence which only has 1 base in it. I am doing this because STREME seemed to performed better when sequence length not greater 1000 bases according to the tutorial.

Example range of interest:

ND_123.1:74801-76802

bedops --everything promoters.forward.bed promoters.reverse.bed | bedops --chop 1000 -

I have stucked in this piece of code for a few days. Hope to get some help.

Undesired output

ND_123.1:74801-75801

ND_123.1:75801-76801

ND_123.1:76801-76802 <== The one base only sequence

To solve it I think I can probably do the following code on the forward strands:

awk -vFS="\t" -vOFS="\t" '($6 == "+"){ print $1, ($2 - 1), $2, $4, $5, $6; }' gene.bed | bedops --range -$WINDOW:-1 --everything - > promoters.forward.bed

And also change the strand on reverse strands from 0:$WINDOW to 1:$WINDOW ... is this making sense?

DNA bedops STREME • 1.0k views

ADD COMMENT • link updated 2.9 years ago by Ram 43k • written 2.9 years ago by greyman ▴ 190

0

Entering edit mode

Is filtering out small intervals (like 1bp sequence) help?

And I think in your code, for the forward strand you need to subtract 1 from end of the peak

awk -vFS="\t" -vOFS="\t" '($6 == "+"){ print $1, $2, ($3-1), $4, $5, $6; }' gene.bed | bedops --range -$WINDOW:-1 --everything - > promoters.forward.bed

ADD REPLY • link updated 2.9 years ago by Ram 43k • written 2.9 years ago by Nitin Narwade ★ 1.6k

1

Entering edit mode

Hi , I would like to use bedops to make a standardize workflow.

In this post I saw the author of bedops did this before merging both files.

gff2bed < genes.gff > genes.bed \
awk -vFS="\t" -vOFS="\t" '($6 == "+"){ print $1, ($2 - 1), $2, $4, $5, $6; }' genes.bed \
    | bedops --range -400:50 --everything - \
    > promoters.for.bed
awk -vFS="\t" -vOFS="\t" '($6 == "-"){ print $1, $3, ($3 + 1), $4, $5, $6; }' genes.bed \
    | bedops --range -50:400 --everything - \
    > promoters.rev.bed`

ADD REPLY • link updated 2.9 years ago by Ram 43k • written 2.9 years ago by greyman ▴ 190

score 3 · Accepted Answer · 2021-05-30

I do see a one-base snippet:

% echo -e "ND_123.1\t74801\t76802" | bedops --chop 1000 -
ND_123.1    74801   75801
ND_123.1    75801   76801
ND_123.1    76801   76802

If we treat ND_123.1:74801-76802 as a zero-based-indexed interval, then it is 2001 bases long:

% calc '76802-74801'
    2001

It won't divide into 1kb windows neatly, so something is up with how this interval is generated.

One cheap way to fix this is to use -x with --chop <N>. This extra option filters out intervals that are not N bases long:

% echo -e "ND_123.1\t74801\t76802" | bedops --chop 1000 -x -
ND_123.1    74801   75801
ND_123.1    75801   76801

But that is hiding the problem. I wouldn't use this.

Another is to perhaps pre-process these intervals so that they are sized in 1k units:

% echo -e "ND_123.1\t74801\t76802" | awk -v FS="\t" -v OFS="\t" '{ print $1, $2, ($3 - 1) }' | bedops --chop 1000 -
ND_123.1    74801   75801
ND_123.1    75801   76801

But that just hides the problem in a different way.

Another way would be to look at what the awk part of the one-liner is doing:

% gff2bed < genes.gff | awk -v FS="\t" -v OFS="\t" '($6 == "+"){ print $1, ($2 - 1), $2, $4, $5, $6; }'

The gff2bed script changes GFF's one-based index to zero-based in the BED-formatted output. So the above should be a correct way of getting a set of single-base, forward-stranded TSS intervals.

In any case, let's say our TSS of interest is at 0-indexed position ND_123.1:76801:76802. This TSS starts at position 76801 on contig `ND_123.1, is one base in length, and is oriented on the forward strand.

We want the 2kb-sized window upstream of this position, oriented on the forward strand.

There are two different sets of windows we can construct.

One window does not include the TSS interval:

% echo -e "ND_123.1\t76801\t76802" | bedops -u --range -2000:-1 -                       
ND_123.1    74801   76801

This is the solution you post in your question:

... | bedops --range -$WINDOW:-1 --everything - | ...

The other window includes the TSS interval:

% echo -e "ND_123.1\t76801\t76802" | bedops -u --range -1999:0 -                       
ND_123.1    74802   76802

Either way, both of these windows are 2kb long. Chopping either of these by 1kb increments gives exactly two 1kb windows:

% echo -e "ND_123.1\t76801\t76802" | bedops -u --range -1999:0 - | bedops --chop 1000 -
ND_123.1    74802   75802
ND_123.1    75802   76802

% echo -e "ND_123.1\t76801\t76802" | bedops -u --range -2000:-1 - | bedops --chop 1000 -
ND_123.1    74801   75801
ND_123.1    75801   76801

So which of these two solutions you pick depends upon whether you want the TSS interval in the output, or not. That will depend on what you're doing with these 1kb windows, probably.

Note that the --range operation will need to be flipped in the case of reverse-strand oriented TSSs. If we treat the example as reverse strand-oriented, then we might do either of the following:

% echo -e "ND_123.1\t76801\t76802" | bedops -u --range 1:2000 -    
ND_123.1    76802   78802

% echo -e "ND_123.1\t76801\t76802" | bedops -u --range 0:1999 - 
ND_123.1    76801   78801

Which parameters are used depends on whether you want to exclude or include the TSS, respectively.

Tl;dr: If what I posted in a previous answer about making specifically-sized windows is different from the above, then my previous answer is incorrect and should not be used. However, the answer that greyman refers to in his comment will include the TSS and genomic space both upstream and downstream of the TSS. It might be correct in that use case, such as for calculating aggregate signal over the TSS, where it can make sense to define 400 bases upstream, one base for the TSS, and 50 bases downstream. In that case, one might line up signal vectors and take the mean (or some other statistic), and it doesn't really matter in that case that there is an "extra" base. (And, in fact, keeping the TSS in helps with plotting.) What matters here — for your use case — is that you need a precise size of windows for chopping up, so your solution will work.