awk : if length to a pattern at column 2 is less than 5
1
0
Entering edit mode
5.6 years ago
mbk0asis ▴ 630

Hi.

I have a multi-column data file.

AGTTAGTTTTAATGTTAATAGT  NCCCAATTTCTAAACAATA 100 263 362 23  19  48.196  48.975
TATTGTTTAGAAATTGGGN CCATACAATACAATAATATACAA 97  344 440 19  23  46.462  47.793
GATGATGATAATGATGATGAT   NTCAATATTCCTCCTCTAT 100 33  132 21  19  48.565  47.848


What I want to do is extracting lines which distance from start to "N" in 1st and 2nd column is less than 5.

In result, 2nd line should be removed because "N" in column 1 is too far from the start.

I think a code like below could work, but haven't figure how to get the distances to "N" at each column.

awk -F "\t" '{if(distance_at_column_1 < 8 && distance_at_column_2 < 8) print }' TEST.txt


Thank you!

awk regex • 2.4k views
1
Entering edit mode

Something like this may work (I did not test it).

awk -FN '{if(length($1)<5 && (length($2)<5)) print $0}' file.txt  ADD REPLY 3 Entering edit mode 5.6 years ago if I understood correctly, you are assuming you are trying to avoid the base "N" in any of the first 5 bases of the sequences on the 1st and 2nd column. if that's the case, here are a few ideas written into oneliners that will do the job. this one prints all lines where the first 5 bases are not anything not-N (so it actually looks for an N) in the 1st or 2nd columns: perl -lane 'print if$F[0] !~ /^[^N]{5}/ or $F[1] !~ /^[^N]{5}/' test.txt this one looks for the N position itself, and prints the line if the N is found in the first 5 bases or if it is not found: perl -lane 'print if index($F[0],"N") < 5 and index($F[1],"N") < 5' test.txt this one looks for an N in a string made up of the first 5 bases of the 1st and the 2nd columns: perl -lane$s = substr($F[0],0,5).substr($F[1],0,5); print if \$s =~ /N/' test.txt

this one (which should be the fastest) looks for an N preceded by a less than 4 bases sequence, where the \b represents a word boundary and \S any non-blank character (which could be forced to [ACGT] to strictly look for known bases): perl -ne 'print if /\b\S{0,4}N/' test.txt

finally, this is probably the simplified awk alternative to the last perl idea you were looking for, where the \y represents the word boundary: awk '/\y\S{0,4}N/' test.txt

I just wanted to point out that there are always multiple ways to reach your goals, and that you don't necessarily need to stop thinking about how to do a particular thing even if you already found an answer. you always have to consider how easy it is to find out other solutions (to invest time and not to waste it), how well will they perform, how robust they are,...

1
Entering edit mode

Wow, Jorge! How do you know all this stuffs? It took a day for me to get a vague idea to solve it. You are a genius! Thank you.

0
Entering edit mode

Hi, Jorge. I tested all your codes, and they worked well. However, most of sequences in actual data don't contain "N", and all of them were not extracted. I tried to improvise your codes but couldn't really figure out. Would you give me a hint?

0
Entering edit mode

you need to check that new premise. print the line if:

1. there is no N, OR
2. it is located in the first 5 bases

adapting the awk solution, it would be something like awk '!/N/ || /\y\S{0,4}N/' test.txt