Extract rows if 2 column values has 21 in between
2
0
Entering edit mode
8.2 years ago
waqasnayab ▴ 250

Hi,

I have a space separated file like this:

start   end
23  36
15  34
7   15
6   15
6   25
21  29
34  41
23  39
22  28
21  29

and I want only those lines if two columns have a value of 21 in between. The desired output would be:

start end
15    34
6      15
6      25
21    29

I searched, there are instances like if column value greater or less than, but not a scenrio like this,

Any help appreciated.

Waqas.

bed sequence • 1.8k views
ADD COMMENT
0
Entering edit mode

What do you mean by "21 in between"? Since you list the interval "6 15" in your desired output I don't understand it.

ADD REPLY
1
Entering edit mode

Changed the formatting, guess op wants to return the intervals that contain a given position. however, [6,15] doesn't contain 21 and therefore the example is wrong. Otherwise this looks like simple case of interval arithmetics. I would like to ask for the biological application of the case, it determines which method is best.

ADD REPLY
0
Entering edit mode

In case you need to search for different locations more than once in a large set of intervals: What Is The Quickest Algorithm For Range Overlap?

ADD REPLY
3
Entering edit mode
8.2 years ago

If your (tab-delimited) text file does not have a header row:

$ awk '($1 < 21) && ($2 > 21)' data.txt > answer.txt

If your text file has a header row ("start" and "end"):

$ tail -n +2 data.txt | awk '($1 < 21) && ($2 > 21)' > answer.txt

Since you tagged your question with the BED tag, if you're working with a BED file, there is a faster way to do this.

The BEDOPS bedextract tool can do a binary or O(log n) search over a sorted BED file, for instance, whereas a simplistic use of awk (such as the ones I wrote above) will read through the entire file, which is a linear or O(n) search. For large BED files, if sorted, a linear scan is a waste of time.

For example, a search for position 21 along a hypothetical chromosome chrN is much faster this way:

$ echo -e "chrN\t21\t22" | bedextract query.bed - > answer.bed

The file answer.bed contains elements from query.bed that overlap ("contain") position 21 — the half-open genomic region [21, 22) — along chromosome chrN.

ADD COMMENT
0
Entering edit mode
8.2 years ago

You're looking for the AND function, and a basic primer on programming concepts.

In R, the code would look like

   y <- x[ x[,1]<21 & x[,2]>21 ,]
ADD COMMENT

Login before adding your answer.

Traffic: 2656 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6