Question: awk extract certain line
0
gravatar for mxlsherry1992
4 weeks ago by
mxlsherry199230 wrote:

Dear all,

Assuming that I have 2 large files, file 1 looks like this:

TRINITY_DN100000_c1_g1
TRINITY_DN100000_c1_g3
TRINITY_DN100000_c1_g6
TRINITY_DN100000_c2_g1
TRINITY_DN100001_c6_g1
TRINITY_DN100001_c6_g2
TRINITY_DN100006_c1_g1
TRINITY_DN100006_c6_g3
TRINITY_DN100006_c6_g4
TRINITY_DN100006_c6_g5
TRINITY_DN100008_c3_g2
TRINITY_DN100010_c4_g4
TRINITY_DN100011_c5_g3
TRINITY_DN100011_c6_g1
TRINITY_DN100012_c2_g1
TRINITY_DN100013_c0_g1
TRINITY_DN100014_c3_g1
TRINITY_DN100014_c7_g1
TRINITY_DN100014_c8_g1

And file 2 looks like this, the first column of file 2 is similar with file 1, but not totally same:

TRINITY_DN100000_c1_g1_i3   ref|XP_017307867.1| PREDICTED: A disintegrin and metalloproteinase with thrombospondin motifs 3 [Ictalurus punctatus]   108255965   XP_017307867    87.591  137 14  5628    5218    27  160 6.59E-67    250
TRINITY_DN100000_c1_g3_i1   ref|XP_017307867.1| PREDICTED: A disintegrin and metalloproteinase with thrombospondin motifs 3 [Ictalurus punctatus]   108255965   XP_017307867    90.385  52  5   1678    1523    158 209 3.76E-21    101
TRINITY_DN100000_c1_g6_i2   ref|XP_017343211.1| PREDICTED: G0/G1 switch protein 2 [Ictalurus punctatus] 108276227   XP_017343211    88.288  111 13  714 382 1   111 2.34E-65    204
TRINITY_DN100000_c2_g1_i1   ref|XP_017307867.1| PREDICTED: A disintegrin and metalloproteinase with thrombospondin motifs 3 [Ictalurus punctatus]   108255965   XP_017307867    100 86  0   2   259 250 335 6.23E-56    186
TRINITY_DN100001_c6_g1_i4   ref|XP_017314334.1| PREDICTED: complement C3-like isoform X1 [Ictalurus punctatus]  108259386   XP_017314334    71.053  76  22  1037    810 44  119 1.3E-25 112
TRINITY_DN100001_c6_g2_i3   ref|XP_017314335.1| PREDICTED: complement C3-like isoform X2 [Ictalurus punctatus]  108259386   XP_017314335    90.909  55  5   218 54  173 227 7.23E-25    105
TRINITY_DN100003_c7_g3_i1   ref|XP_017346976.1| PREDICTED: uncharacterized protein LOC108278261 isoform X2 [Ictalurus punctatus]    108278261   XP_017346976    40.678  59  32  224 48  496 551 8.74E-08    48.1
TRINITY_DN100004_c2_g1_i4   ref|XP_017348024.1| PREDICTED: transcription factor 7 isoform X11 [Ictalurus punctatus] 108278866   XP_017348024    95.238  63  3   177 365 1   63  8.65E-31    125
TRINITY_DN100004_c2_g4_i2   ref|XP_017323173.1| PREDICTED: transcription factor 7-like 1-B [Ictalurus punctatus]    108265410   XP_017323173    92.265  181 14  412 954 402 582 2.37E-46    168
TRINITY_DN100004_c2_g6_i1   ref|XP_017323173.1| PREDICTED: transcription factor 7-like 1-B [Ictalurus punctatus]    108265410   XP_017323173    100 38  0   7   120 364 401 2.04E-19    84
TRINITY_DN100004_c5_g1_i1   ref|XP_017326725.1| PREDICTED: uncharacterized protein LOC108267243 [Ictalurus punctatus]   108267243   XP_017326725    53.086  81  30  37  255 447 527 1.79E-18    81.6
TRINITY_DN100006_c1_g1_i1   ref|XP_017331360.1| PREDICTED: zinc finger MYM-type protein 4-like isoform X3 [Ictalurus punctatus] 108269816   XP_017331360    51.111  45  22  354 220 28  72  0.00000501  45.1
TRINITY_DN100006_c6_g3_i4   ref|XP_017332404.1| PREDICTED: bifunctional epoxide hydrolase 2 [Ictalurus punctatus]   108270338   XP_017332404    90.714  140 13  3270    2851    327 466 1.33E-77    269
TRINITY_DN100006_c6_g4_i3   ref|XP_017332404.1| PREDICTED: bifunctional epoxide hydrolase 2 [Ictalurus punctatus]   108270338   XP_017332404    79.07   43  9   2044    1916    284 326 1.08E-13    75.5
TRINITY_DN100006_c6_g5_i1   ref|XP_017346197.1| PREDICTED: sortilin-related receptor isoform X2 [Ictalurus punctatus]   108277790   XP_017346197    100 25  0   78  4   499 523 3.53E-09    55.1
TRINITY_DN100008_c3_g2_i8   ref|XP_017331889.1| PREDICTED: prolactin regulatory element-binding protein isoform X1 [Ictalurus punctatus]    108270080   XP_017331889    87.5    264 29  539 1318    19  282 1.03E-134   429
TRINITY_DN100010_c4_g4_i2   ref|XP_017339405.1| PREDICTED: voltage-dependent T-type calcium channel subunit alpha-1G isoform X7 [Ictalurus punctatus]   108274069   XP_017339405    91.892  37  3   1360    1250    1975    2011    4.63E-12    70.5
TRINITY_DN100011_c5_g3_i1   ref|XP_017321691.1| PREDICTED: lactase-like protein [Ictalurus punctatus]   108264568   XP_017321691    96.552  29  1   87  1   527 555 1.63E-11    62
TRINITY_DN100011_c6_g1_i1   ref|XP_017310205.1| PREDICTED: engulfment and cell motility protein 1 isoform X1 [Ictalurus punctatus]  108257187   XP_017310205    100 22  0   67  2   362 383 0.00000161  50.8
TRINITY_DN100012_c2_g1_i2   ref|XP_017351172.1| PREDICTED: disks large homolog 1 isoform X16 [Ictalurus punctatus]  108280564   XP_017351172    96.078  153 6   642 184 139 291 4.57E-87    298
TRINITY_DN100013_c0_g1_i3   ref|XP_017329849.1| PREDICTED: sorbin and SH3 domain-containing protein 2-like isoform X17 [Ictalurus punctatus]    108268954   XP_017329849    76.687  163 34  2448    1972    104 266 1.61E-57    218
TRINITY_DN100014_c3_g1_i1   sp|O00370|LORF2_HUMAN   LINE-1 retrotransposable element ORF2 protein OS=Homo sapiens OX=9606 PE=1 SV=1 O00370  0   43.59   78  42  455 228 768 845 6.15E-13    68.6
TRINITY_DN100014_c7_g1_i4   ref|XP_017349364.1| PREDICTED: uncharacterized protein LOC108279543 [Ictalurus punctatus]   108279543   XP_017349364    40.984  61  36  1204    1386    103 163 4.35E-14    60.5
TRINITY_DN100014_c8_g1_i1   ref|XP_017326732.1| PREDICTED: uncharacterized protein LOC108267251 [Ictalurus punctatus]   108267251   XP_017326732    73.438  64  17  478 669 281 344 1.24E-18    85.1
TRINITY_DN100014_c8_g2_i1   ref|XP_017317911.1| PREDICTED: uncharacterized protein LOC108261586 [Ictalurus punctatus]   108261586   XP_017317911    62.281  114 41  1324    1659    475 588 1.1E-57 103

Now, I want to extracted the line in file 2 according the first column that matched with file 1, I have two scripts:

grep -Fwf file1.txt file2.txt > results

awk 'NR==FNR{x[$0];next}{for(i in x)if($0~i)print}' file1.txt file2.txt

But the results only contain one line:

TRINITY_DN100014_c8_g1_i1   ref|XP_017326732.1| PREDICTED: uncharacterized protein LOC108267251 [Ictalurus punctatus]   108267251   XP_017326732    73.438  64  17  478 669281  344 1.24E-18    85.1

It should not be correct, because thiere is 19 line in file 2 that can match with file1.

So if you know how should I correct the script..?

Thank you!

rna-seq • 221 views
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by mxlsherry199230
1

In addition to the excel line ending issue, your grep command won't work because of the -w (grep the exact match). If you remove the -w then

TRINITY_DN100000_c1_g1

of file1.txt will pick out

TRINITY_DN100000_c1_g1_1
TRINITY_DN100000_c1_g1_2
TRINITY_DN100000_c1_g1_3

of file2.txt

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by rbagnall1.4k

Your awk command works perfectly on my system (default Gawk). Your grep command returns nothing for me. I'm running Ubuntu 16.04

Be sure that your files are tab-delimited. Also, if you copied them from a Windows system, then they may have extraneous characters that can be removed with dos2unix

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Kevin Blighe48k

Hi, thanks for your kindly reply! I am using MACbook, and the txt file is copied from xlsx file, if I will do anything to make it tab-delimited?

ADD REPLYlink written 4 weeks ago by mxlsherry199230

Perhaps the tab-delimited part is not so key due to this section of your code if($0~i), which ultimately searches for the pattern from file1 in the entire line ofrom file2

How did you copy the data from Excel to TXT?

ADD REPLYlink written 4 weeks ago by Kevin Blighe48k
1

Hi, I just used "command+c" and "command+v" to copy the data...It worked!! at first I copied directly from xlsx, now I paste it first to byword and then paste to txt file, then it worked!!

Thank you!!

ADD REPLYlink written 4 weeks ago by mxlsherry199230

Okay, it is probably related to what Ram said, and something to do with the endline characters.

ADD REPLYlink written 4 weeks ago by Kevin Blighe48k

Excel might use \r instead of \n for line endings. Try head <filename> | cat -te or head <filename> | cat -A (whichever works) to view the invisible characters in your file. If the head prints all of the file, try sed 's/\r/\n/1' filename | head to see if that helps.

ADD REPLYlink written 4 weeks ago by RamRS24k

In Linux dos2unix excel.txt will fix the Windows line endings to be Unix compatible.

ADD REPLYlink written 4 weeks ago by Torst950

Based on the manual, dos2unix excel.txt alone may not work. The default is to convert CRLF to LF. Excel on Mac exports lines ending in CR, not CRLF, so one would need dos2unix -c mac excel.txt

ADD REPLYlink written 4 weeks ago by RamRS24k

Hi all, it all worked, Thank you and have a great day!!

ADD REPLYlink written 4 weeks ago by mxlsherry199230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 755 users visited in the last hour