How to build a command to filter an interval using -grep on Linux
3
1
Entering edit mode
21 months ago
Alex S ▴ 20

I have a set of data that looks like this:

NK.Chr1:75500000-95000000:28960-29007   NG-unitig0655   97.872  47  1   0   1   47  121009  120963  2.90e-14    80.6
NK.Chr1:75500000-95000000:28960-29007   NG-1DRT-unitig0549  97.872  47  1   0   1   47  623680  623726  2.90e-14    80.6
NK.Chr1:75500000-95000000:28960-29007   NG-1DRT-unitig0278  97.872  47  1   0   1   47  1224581 1224627 2.90e-14    80.6
NK.Chr1:75500000-95000000:28960-29007   NG-1DRT-Chr4    97.872  47  1   0   1   47  8416368 8416414 2.90e-14    80.6
NK.Chr1:75500000-95000000:28960-29007   NG-1DRT-Chr4    97.872  47  1   0   1   47  20041035    20041081    2.90e-14    80.6
NK.Chr1:75500000-95000000:28960-29007   NG-1DRT-Chr4    97.872  47  1   0   1   47  35175472    35175426    2.90e-14    80.6
NK.Chr1:75500000-95000000:28960-29007   NG-1DRT-Chr4    97.872  47  1   0   1   47  56460095    56460049    2.90e-14    80.6

I need to filter the lines in the range of 0-3900000, considering only the numbers before NG.

grep 'NK.Chr1:75500000-95000000:[0-3900000]' NG.1DRT-blast.out > chr1-blast-NG.txt

I tried this code, but it returned all the lines with NK.Chr1:75500000-95000000:, not considering the range.

Anyone knows how to build a proper code for it?

grep Linux • 1.6k views
ADD COMMENT
1
Entering edit mode

Here [0-3900000] is treated as a regular expression, which means any character between 0 and 3 (0, 1, 2, 3), or 9.

$ seq 10 | grep '[0-3900000]'
1
2
3
9
10  # with '1' and '0'
ADD REPLY
2
Entering edit mode
21 months ago
Jeremy ▴ 890

I think the following should work:

awk -F '[:-]'  '($1~/NK.Chr1/ && $2=75500000 && $3=95000000 && $4<=3900000 && $5<=3900000)'
ADD COMMENT
1
Entering edit mode

It works perfectly!! Thanks a lot :)

ADD REPLY
1
Entering edit mode
21 months ago
awk -F '[ \t:-]'  '(!($4> 3900000 || $5<0))' < in.blast
ADD COMMENT
0
Entering edit mode

Hi Alex, Shenwei, and Pierre,

I couldn't figure this out until I saw Pierre's code, but I think Alex is asking for the following:

awk -F '[:-]'  '($4<3900000 && $5<3900000)'
ADD REPLY
0
Entering edit mode

I really need the NK.Chr1:75500000-95000000: as part of the code. Both codes are returning other Chr groups.

ADD REPLY
0
Entering edit mode
21 months ago

Try csvtk:

  1. Extract coordinates (saved as 13th and 14th column):

     $ cat data.tsv \
         | csvtk mutate -Ht -f 1 -p '(\d+)-\d+$' \
         | csvtk mutate -Ht -f 1 -p '\d+-(\d+)$' \
         > tmp.tsv
    
     $ csvtk dim -H  tmp.tsv 
     file      num_cols   num_rows
     tmp.tsv         14          7
    
  2. Filtering based on the range and remove the temporary columns:

     cat tmp.tsv \
         | csvtk filter2 -Ht -f '$13 >= 0 && $14 <= 3900000' \
         | csvtk cut -Ht -f 1-12 \
         > result.tsv
    

All-in-one:

$ cat data.tsv \
    | csvtk mutate -Ht -f 1 -p '(\d+)-\d+$' \
    | csvtk mutate -Ht -f 1 -p '\d+-(\d+)$' \
    | csvtk filter2 -Ht -f '$13 >= 0 && $14 <= 3900000' \
    | csvtk cut -Ht -f 1-12 \
    > result.tsv
ADD COMMENT

Login before adding your answer.

Traffic: 2719 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6