Question: awk specific line from another file
1
gravatar for Hughie
23 days ago by
Hughie10
Hughie10 wrote:

Hello! everyone:

I'm new to linux, here I got a problem:

I have a file file1 like:

3  
6  
7  
9  
12

and file2 which is tab-delimited:

chr1    3052600 3052800 1       E3  
chr1    3052800 3053000 2       E3  
chr1    3059400 3059600 3       E3  
chr1    3059600 3059800 4       E3  
chr1    3059800 3060000 5       E3  
chr1    3062600 3062800 6       E3  
chr1    3101000 3101200 7       E3  
chr1    3105000 3105200 8       E3  
chr1    3105200 3105400 9       E3  
chr1    3116800 3117000 10      E2  
chr1    3117000 3117200 11      E2  
chr1    3164800 3165000 12      E2

and I want to extract the lines in file2 which its 4-th column equal the number in file1 like below:

chr1 3059400 3059600 3 E3   
chr1 3062600 3062800 6 E3     
chr1 3101000 3101200 7 E3   
chr1 3105200 3105400 9 E3   
chr1 3164800 3165000 12 E2

I have spent several hours including wrote a very slow python script, and I searched for the oneline solution, but I got nothing!

awk -v FS="\t" 'NR==FNR{rows[$1]++;next}(substr($NF,1,length($NF)-1) in rows)'  fiel1 file2

Thanks a lot for some suggestions!

awk formatting • 249 views
ADD COMMENTlink modified 16 days ago • written 23 days ago by Hughie10
1

Hello Hughie,

Please use appropriate tags. Your question is about formatting and awk. That should have been a tag when you created the question. When you add appropriate tags, users that follow the tag (usually experts interested in helping others in that subject matter) get notified of your question, and this means you stand a better chance at getting a relevant, useful response faster.

ADD REPLYlink written 23 days ago by WouterDeCoster30k

Thank you Wouter!
I have revised the tag

ADD REPLYlink written 17 days ago by Hughie10
2

The appropriate tags that Wouter mentioned are formatting and awk, not the one out-of-place formatting and awk tag.

ADD REPLYlink written 17 days ago by Ram16k

Thank you !
Revised again!

ADD REPLYlink written 16 days ago by Hughie10
5
gravatar for 5heikki
23 days ago by
5heikki7.4k
Finland
5heikki7.4k wrote:

This is actually a job for join

join -t $'\t' -1 1 -2 4 -o 2.1,2.2,2.3,2.4,2.5 \
    <(sort -g file1) \
    <(sort -t $'\t' -k4,4g file2) \
    > out
ADD COMMENTlink modified 23 days ago • written 23 days ago by 5heikki7.4k

Thank you 5heikki! This is a new solution

ADD REPLYlink written 17 days ago by Hughie10
5
gravatar for shenwei356
23 days ago by
shenwei3564.0k
China
shenwei3564.0k wrote:

simple with https://github.com/shenwei356/csvtk

csvtk grep -H -t -f 4 -P file1 file2 > result
ADD COMMENTlink written 23 days ago by shenwei3564.0k

Thank you! I will try this

ADD REPLYlink written 17 days ago by Hughie10
5
gravatar for EagleEye
23 days ago by
EagleEye5.5k
Sweden
EagleEye5.5k wrote:

I assume your file2 is TAB-delimited.

sed -i 's/^/\t/' file1.txt
sed -i 's/$/\t/' file1.txt
grep -fF file1.txt file2.txt > combined.txt
ADD COMMENTlink modified 17 days ago • written 23 days ago by EagleEye5.5k
5
gravatar for finswimmer
23 days ago by
finswimmer3.6k
Germany
finswimmer3.6k wrote:

Hello Hughie,

but there are some problem

you should describe what these problems are.

Nevertheless this should work:

$ awk 'NR==FNR {a[$1]; next} $4 in a {print $0}' file1 file2

fin swimmer

ADD COMMENTlink written 23 days ago by finswimmer3.6k
4

Further shortening awk solution:

$ awk 'NR==FNR {a[$1]++} a[$4]' file1 file2
chr1    3059400 3059600 3   E3
chr1    3062600 3062800 6   E3
chr1    3101000 3101200 7   E3
chr1    3105200 3105400 9   E3
chr1    3164800 3165000 12  E2
ADD REPLYlink written 23 days ago by cpad01127.6k

Nice!

Could you please explain why this works?

fin swimmer

ADD REPLYlink written 23 days ago by finswimmer3.6k
2

@finswimmer: {a[$1]++} = makes an array of first column from first file (from NR==FNR). Since it is single column 1 or 0 should not matter and no other commands as well. When it moves on second file, it makes index from column 4 ($4) from second file. Matches with those lines from array from first step (a[$4]), then prints matching lines from file 2.

Actually it is collapsed from your awk solution above:

awk 'NR==FNR {a[$1]; next} $4 in a {print $0}' file1 file2
awk 'NR==FNR {a[$1]++} $4 in a {print $0}' file1 file2
awk 'NR==FNR {a[$1]++} $4 in a' file1 file2
awk 'NR==FNR {a[$1]++} a[$4]' file1 file2
ADD REPLYlink modified 23 days ago • written 23 days ago by cpad01127.6k
1

Hello cpad0112,

thanks a lot for this explanation. I didn't know that awk skippes the part after {...} without using next.

fin swimmer

ADD REPLYlink written 23 days ago by finswimmer3.6k
1

Np. You will learn all these tricks from Pierre, Kevin and other stars awk one liners, the way I did.

ADD REPLYlink modified 23 days ago • written 23 days ago by cpad01127.6k
1

I investigate a little bit more on how this works and I think I got it.

NR==FNR {a[$1]++} makes in array with the first column from the first file as index and assign an positiv number to it (increment with ++).

The part after {...} isn't realy skipped. It is the next condition! If it evaluate to true than the line, which is currently processed by awk, is printed. As the first file haven't a 4th column, this always will be false and nothing will be printed. If we reach the second file, a[$4] evaluates to true if the value from the 4th column is an index if a and returns not 0. This is why we need to increment before (a[$1]=1 would be fine as well).

fin swimmer

ADD REPLYlink written 23 days ago by finswimmer3.6k

Hi, this gives weird output with the following example. In f1.txt, I do not have '12' but still the output provides a row with '12'.

cat f1.txt
3
6
7
9

cat f2.txt

chr1    3052600 3052800 1       E3
chr1    3052800 3053000 2       E3
chr1    3059400 3059600 3       E3
chr1    3059600 3059800 4       E3
chr1    3059800 3060000 5       E3
chr1    3062600 3062800 6       E3
chr1    3101000 3101200 7       E3
chr1    3105000 3105200 8       E3
chr1    3105200 3105400 9       E3
chr1    3116800 3117000 10      E2
chr1    3117000 3117200 11      E2
chr1    3164800 3165000 12      E2

awk 'NR==FNR {a[$1]++} a[$4]' f1.txt f2.txt

chr1    3059400 3059600 3       E3
chr1    3062600 3062800 6       E3
chr1    3101000 3101200 7       E3
chr1    3105200 3105400 9       E3
chr1    3164800 3165000 12      E2
ADD REPLYlink written 23 days ago by EagleEye5.5k

In OP, first file has 12

ADD REPLYlink written 23 days ago by cpad01127.6k

Yes but with my example above it did not work. I did not have 12 in my 'f1.txt'.

ADD REPLYlink modified 23 days ago • written 23 days ago by EagleEye5.5k

Oops sorry it works fine. I did a stupid mistake.

ADD REPLYlink written 23 days ago by EagleEye5.5k

Thank you cpad0112 for your nice answer! question sloved!

ADD REPLYlink written 17 days ago by Hughie10

Thank you fin swimmer for your reply!

ADD REPLYlink written 17 days ago by Hughie10
1

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLYlink written 17 days ago by WouterDeCoster30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1097 users visited in the last hour