Question

awk : if length to a pattern at column 2 is less than 5

0

Entering edit mode

8.0 years ago

mbk0asis ▴ 680

Hi.

I have a multi-column data file.

AGTTAGTTTTAATGTTAATAGT  NCCCAATTTCTAAACAATA 100 263 362 23  19  48.196  48.975
TATTGTTTAGAAATTGGGN CCATACAATACAATAATATACAA 97  344 440 19  23  46.462  47.793
GATGATGATAATGATGATGAT   NTCAATATTCCTCCTCTAT 100 33  132 21  19  48.565  47.848

What I want to do is extracting lines which distance from start to "N" in 1st and 2nd column is less than 5.

In result, 2nd line should be removed because "N" in column 1 is too far from the start.

I think a code like below could work, but haven't figure how to get the distances to "N" at each column.

awk -F "\t" '{if(distance_at_column_1 < 8 && distance_at_column_2 < 8) print }' TEST.txt

Thank you!

awk regex • 3.5k views

ADD COMMENT • link updated 8.0 years ago by Jorge Amigo 14k • written 8.0 years ago by mbk0asis ▴ 680

1

Entering edit mode

Something like this may work (I did not test it).

awk -FN '{if(length($1)<5 && (length($2)<5)) print $0}' file.txt

ADD REPLY • link 8.0 years ago by venu 7.1k

score 3 · Accepted Answer · 2016-05-02

if I understood correctly, you are assuming you are trying to avoid the base "N" in any of the first 5 bases of the sequences on the 1st and 2nd column. if that's the case, here are a few ideas written into oneliners that will do the job.

this one prints all lines where the first 5 bases are not anything not-N (so it actually looks for an N) in the 1st or 2nd columns: perl -lane 'print if $F[0] !~ /^[^N]{5}/ or $F[1] !~ /^[^N]{5}/' test.txt

this one looks for the N position itself, and prints the line if the N is found in the first 5 bases or if it is not found: perl -lane 'print if index($F[0],"N") < 5 and index($F[1],"N") < 5' test.txt

this one looks for an N in a string made up of the first 5 bases of the 1st and the 2nd columns: perl -lane $s = substr($F[0],0,5).substr($F[1],0,5); print if $s =~ /N/' test.txt

this one (which should be the fastest) looks for an N preceded by a less than 4 bases sequence, where the \b represents a word boundary and \S any non-blank character (which could be forced to [ACGT] to strictly look for known bases): perl -ne 'print if /\b\S{0,4}N/' test.txt

finally, this is probably the simplified awk alternative to the last perl idea you were looking for, where the \y represents the word boundary: awk '/\y\S{0,4}N/' test.txt

I just wanted to point out that there are always multiple ways to reach your goals, and that you don't necessarily need to stop thinking about how to do a particular thing even if you already found an answer. you always have to consider how easy it is to find out other solutions (to invest time and not to waste it), how well will they perform, how robust they are,...