Question: how to match column entries among files?
0
gravatar for v27
12 weeks ago by
v2710
v2710 wrote:

I have 3 files : File1

WP_082277722.10 gi|37527581|ref|NP_930925.1|    59.420  138 46  3   141 275 2895    3025    1.34e-41    157 49
WP_082277722.10 gi|91779924|ref|YP_555132.1|    49.254  134 56  3   151 276 2074    2203    6.02e-27    115 46
WP_082277722.10 gi|212640077|ref|YP_002316597.1|    33.981  103 57  5   180 275 403 501 0.008   42.7    35
WP_082277722.10 gi|384147149|ref|YP_005529965.1|    37.681  69  35  3   210 276 2019    2081    0.032   41.2    24
WP_082277722.10 gi|212640328|ref|YP_002316848.1|    32.323  99  56  5   184 275 469 563 0.052   40.0    33
WP_082277722.10 gi|338738658|ref|YP_004675620.1|    26.761  71  43  1   205 275 4   65  0.99    35.0    26

File 2

158333741,Acaryochloris_marina_MBIC11017_uid58167,158333741,432,1,432,COG0001,0,
158339504,Acaryochloris_marina_MBIC11017_uid58167,158339504,491,1,491,COG0001,0,
379012832,Acetobacterium_woodii_DSM_1030_uid88073,379012832,430,1,430,COG0001,0,
302391336,Acetohalobium_arabaticum_DSM_5501_uid51423,302391336,441,1,441,COG0001,0,
311103820,Achromobacter_xylosoxidans_A8_uid59899,311103820,425,1,425,COG0001,0,
332795879,Acidianus_hospitalis_W1_uid66875,332795879,369,1,369,COG0001,0,
332796307,Acidianus_hospitalis_W1_uid66875,332796307,416,1,416,COG0001,0,
37527581,Acetohalobium_arabaticum_DSM_5501_uid51423,302391336,441,1,441,COG0001,0

,

File3

    COG0001 H   Glutamate-1-semialdehyde aminotransferase
COG0002 E   N-acetyl-gamma-glutamylphosphate reductase
COG0003 P   Anion-transporting ATPase, ArsA/GET3 family
COG0004 P   Ammonia channel protein AmtB
COG0005 F   Purine nucleoside phosphorylase
COG0006 E   Xaa-Pro aminopeptidase
COG0007 H   Uroporphyrinogen-III methylase (siroheme synthase)
COG0008 J   Glutamyl- or glutaminyl-tRNA synthetase
COG0009 J   tRNA A37 threonylcarbamoyladenosine synthetase subunit TsaC/SUA5/YrdC
COG0010 E   Arginase family enzyme
COG0011 S   Uncharacterized conserved protein YqgV, UPF0045/DUF77 family
COG0012 J   Ribosome-binding ATPase YchF, GTP1/OBG family

Expected outcome :

WP_082277722.10 gi|37527581|ref|NP_930925.1|  37527581,Acetohalobium_arabaticum_DSM_5501_uid51423,302391336,441,1,441,COG0001,0  COG0001    H   Glutamate-1-semialdehyde aminotransferase
command line programming • 234 views
ADD COMMENTlink modified 12 weeks ago by RamRS18k • written 12 weeks ago by v2710

what have you tried ?

ADD REPLYlink written 12 weeks ago by Pierre Lindenbaum113k

i can do this in excel ..... but i have 2500 files for which i need to repeat the procedure .

ADD REPLYlink written 12 weeks ago by v2710
3

enter image description here

ADD REPLYlink written 12 weeks ago by Pierre Lindenbaum113k
1

Stealing that.... :P

ADD REPLYlink written 12 weeks ago by jrj.healey7.7k
1

Write a macro? You have to get into a programming mindset, end of the day it doesn't matter what tool you use, only matters how you use that tool and how well you know it. Excel is deceptively complex. You will encounter problems that you cannot explain, get to the root of or even reproduce at times. Excel does not preserve state or document logic. When it comes to problems such as yours, it's just a glorified calculator.

Use R or python, something that helps you learn the programming mindset and document your actions, as you will need to do this again in 6 months' time and programming means you'll have to spend only a fraction of this time then.

ADD REPLYlink written 12 weeks ago by RamRS18k

Excel is a hammer. The problem before you is a screw. Yes, it's a big hammer, and if you swing hard enough, you can make Excel do what you want, but it would be smarter to find a screwdriver. R (or Rstudio might be a bit easier to work with) can do this.

ADD REPLYlink written 12 weeks ago by swbarnes24.2k

This question is presented without a proper explanation in body of the post. Please edit the original question and add this information there.

Instead of just posting parts of three files you should spell out which parts of the three are supposed to contain a key that can be used to relate the records. I tried to match a couple three numbers from file 1 but could not do it visually.

WP_082277722.10 gi|384147149|ref|YP_005529965.1| Is the gi number 384147149 supposed to be in file 2? Then one needs to match the COG id from that line in file 2 with the list in file 3?

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by genomax57k

The examples posted here seem incomplete, which OP should have mentioned. While OP does give an example of desired output, they should also point out the logic.

I suspect this is part of their learning the mindset though, so once they do learn they will no longer need any help with this sort of problem :-)

ADD REPLYlink written 12 weeks ago by RamRS18k

Don't forget to follow up on your previous threads, multiple posts are without accepted answer:

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by RamRS18k
2
gravatar for RamRS
12 weeks ago by
RamRS18k
Houston, TX
RamRS18k wrote:

Use R, not the command line. You will need to read in the data files, preprocess them a little to clean and get your identifiers in separate columns, and then use functions such as merge or dplyr joins to achieve your goal.

This is a good exercise, so I will not be giving you the code.

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by RamRS18k
3

Use linux join , not R , You will need to read in the data files, extract a column with awk -F '|' '{printf("%s\t%s\n",$2,$0);}' then a little to clean and get your identifiers in separate columns.

This is a good exercise, so I will not be giving you the code.

;o)

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by Pierre Lindenbaum113k
2

OP's going to need either a bunch of temporary files or a bunch of process substitutions. I'd go with R, this is sophisticated enough for a beginner to use a easier tool than plain linux, I think. I mean, you would obviously go for linux join and paste and whatnot, and I'd dabble too, but R is just easier here.

ADD REPLYlink written 12 weeks ago by RamRS18k

of course, you're right, I'm just kidding :o)

ADD REPLYlink written 12 weeks ago by Pierre Lindenbaum113k

could you please give me the code .. i have to repeat this for 2500 files. code would make it a lot easier for me

ADD REPLYlink written 12 weeks ago by v2710
4

You should at least show some effort as Pierre pointed out above. He and Ram will no doubt be happy to help you, but not do your work for you ;)

You will never improve your skills if you don't try and get your hands dirty :)

ADD REPLYlink written 12 weeks ago by jrj.healey7.7k

just running on a deadline right now...

ADD REPLYlink written 12 weeks ago by v2710

OK, I will give you the code for file1 only. You should be able to do the rest yourself.

Let's say the file is called file1.txt. In R you'll need:

f1 <- read.table('file1.txt', header=FALSE, sep="\t");
f1 <- cbind(f1, t(data.frame(strsplit(f1[,2], split="|", fixed=TRUE))))

This will split the delimiters in file1, so you can merge objects.

step1 <- merge(x=f1, y=f2, by.x = <file1_colname>, by.y=<file2_column1name>, all=T)
step2 <- merge(x=step1, y=f3, by.x = <col_with_COGids>, by.y=<file3_column1name>, all=T)

step2 will have the data you need, which you can then use apply and paste(collapse=",") to get the remaining columns. Talk to your supervisor and see if you can get your deadline extended - this task might take a while.

Hint: Once you read file2.txt into f2, Just use apply and paste(collapse=",") on everything but the ID column before you merge. That way, your processing at the end can be avoided and you can just rearrange and omit columns to get to your desired output.

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by RamRS18k

thank you for the code :)

ADD REPLYlink written 12 weeks ago by v2710
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1981 users visited in the last hour