Question: Match And Merge Information By Id In Two Files
1
gravatar for liupfskygre
6.9 years ago by
liupfskygre190
United States
liupfskygre190 wrote:

Hi all,

I have two files (a.txt; b.txt), columns were separate by tab,

a.txt:

id value1 value2 value3

g1 v1 v2 good

g2 v3 v4 better

b.txt:

id value1 value2

g1 v1 v2

g3 v3 v4

I want to use id as the keywords on use perl to search the same id in the two files,

if id in a==id in b, like g1

then append value3 of g1 in a.txt to the end of line g1 in b.txt

I want to use this method to deal with my GO-gene association file and gff files, and I am just beginning to use Perl. I know it is not a good manner to ask for scripts, so just me some hints on where to begin. Any suggestions are appreciated!

perl id • 8.7k views
ADD COMMENTlink modified 3.5 years ago by Rashedul Islam380 • written 6.9 years ago by liupfskygre190
1
gravatar for Pavel Senin
6.9 years ago by
Pavel Senin1.9k
Los Alamos, NM
Pavel Senin1.9k wrote:

if your files (sets) are not huge, you can handle this easily using R, without resorting to Perl:

# read data in
a = read.table("~/tmp/a.txt",header=T)
b = read.table("~/tmp/b.txt",header=T)

# merge datasets
merge(a,b,by="id")
  id value1.x value2.x value3 value1.y value2.y
1 g1       v1       v2   good       v1       v2


# extract results
c = merge(a,b,by="id")
res=c[,c(1,6,5,4)]
res
  id value2.y value1.y value3
1 g1       v2       v1   good

# produce output
write.table(res, file="~/tmp/c.txt",quote=F)

R is more interactive - easier to re-run and take care about other things (comparison constraints, etc), plus it allows to run script in batch mode. But Perl is more streamlined and command-line oriented if you after that.

ADD COMMENTlink modified 11 months ago by _r_am31k • written 6.9 years ago by Pavel Senin1.9k

Thank you for your reply! I have a look at your profile, and found, wow, I have introduced several of your paper in my lab journal club about Verrucomicrobia!

ADD REPLYlink written 6.9 years ago by liupfskygre190

thank you, those are not exactly mine, but our team effort, I just performed some data processing

ADD REPLYlink written 6.9 years ago by Pavel Senin1.9k

another questions, how could I keep the rows not include in the merge file, I mean data (a-c) and (b-c)? Thank you!

ADD REPLYlink written 6.9 years ago by liupfskygre190

I read Phil Spector's book Data Manipulation with R, I found 'merge' has several parameters which I could use of! Thanks!

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by liupfskygre190

RStudio is a nice way to keep things tidy. If in R you type ?merge it will show you detailed help about the command.

ADD REPLYlink written 6.9 years ago by Pavel Senin1.9k

Thanks you for this tip!

ADD REPLYlink written 6.9 years ago by liupfskygre190

Merging the two files and then parsing the results for this task is not a good solution--in any language. Also, your comments about Perl are rather puzzling.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Kenosis1.2k

Let me step back a bit. I think that Intersection) is one of the standard (basic) operations from sets theory, and can be handled in a variety of ways in any language. Not handling this, is the "suicide" for any language. R is interactive, Perl is not. edit: biostars doesnt handle that wikipedia link, sorry.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Pavel Senin1.9k

Add only what you need to add--and nothing more. Adding the informational 'noise' by merging everything subsequently required removing that 'noise.' I still have no idea what you mean by "R is interactive, Perl is not."

ADD REPLYlink written 6.9 years ago by Kenosis1.2k

By default, R starts interactively, while Perl is not really designed that way. I might be wrong, but merging (joining) two sets in a deterministic way requires some sort of indexing or sorting and is computationally expensive probably not better than O(nlogn), unless you are ok with some errors. In my solution I read everything in the memory, but merge operation took care about sorting and selection.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Pavel Senin1.9k
1
gravatar for matted
6.9 years ago by
matted7.3k
Boston, United States
matted7.3k wrote:

There are many ways to do this join task. To avoid small bugs from writing your own custom code, and the probable inefficiencies of a naive implementation, you could use the join utility on a Linux system. It should do what you want.

$ join -j 1 -t " " a.txt b.txt
id value1 value2 value3 value1 value2
g1 v1 v2 good v1 v2
ADD COMMENTlink modified 11 months ago by _r_am31k • written 6.9 years ago by matted7.3k

good solution. you may want to add an awk call to shuffle columns in a way OP wants.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Pavel Senin1.9k

The OP's wanted output line pattern consists of four fields:

g1    v1    v2    good

Your routine's output adds two extra fields:

g1 v1 v2 good v1 v2
ADD REPLYlink modified 11 months ago by _r_am31k • written 6.9 years ago by Kenosis1.2k

This is trivial to fix...

ADD REPLYlink written 6.9 years ago by matted7.3k

Why knowingly leave an erroneous answer for the OP to fix?

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Kenosis1.2k
1
gravatar for Rashedul Islam
3.5 years ago by
Canada
Rashedul Islam380 wrote:

You can use dplyr in R as well.

a = read.table("~/tmp/a.txt",header=T)  
b = read.table("~/tmp/b.txt",header=T)

library(dplyr)  
inner_join(a, b)
ADD COMMENTlink modified 11 months ago by _r_am31k • written 3.5 years ago by Rashedul Islam380
0
gravatar for Kenosis
6.9 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

This can be done in two steps: 1) split each line of a.txt and build a hash, where the id is a key and value3 is the key's associated value, and 2) split each line of b.txt and print the line plus value3 if the current id exists as a key in the hash:

use strict;
use warnings;

my ( @cols, %hash );

while (<>) {
    $hash{ $cols[0] } = $cols[-1] if @cols = split;
    last if eof;
}

while (<>) {
    print +( join "\t", @cols, $hash{ $cols[0] } ), "\n"
      if @cols = split and exists $hash{ $cols[0] };
}

Usage: perl script.pl a.txt b.txt [>outFile.txt]

The last, optional parameter directs output to a file.

The last if eof; is placed within the first while since both files would be completely read without it. The + after the print in the second while helps perl (the interpreter) disambiguate what to do.

This one-liner will produce the same results:

perl -ane '$h{$F[0]}=$F[-1];last if eof;END{while(<>){@F=split; s/\K$/\t$h{$F[0]}/ and print if$h{$F[0]}}}' a.txt b.txt [>outFile.txt]

Hope this helps!

ADD COMMENTlink modified 11 months ago by _r_am31k • written 6.9 years ago by Kenosis1.2k

I am astonished by all the discussions and suggestiones you brought. For a beginner, I really learned a lot from this. I have tried the merge way in R, use M=merge(a, b, all=TRUE), a, b is my gene list and DE list, they have a common id column. there about 3000 lines in a and b. after this merge, I think they will merge all things of a and b by the id column, and thing all indentical in a or b would be set NA, as the ?merge said. but the results showed that before 1095 lines all things were what I wanted, but after that all were messed´╝ü I do not know whether merge has limitation for lines or file size. By the way, I think this link illustrate merge very nice http://dss.princeton.edu/training/Merge101R.pdf

ADD REPLYlink modified 11 months ago by _r_am31k • written 6.9 years ago by liupfskygre190

after use Rstudio to import things without using command, I found it worked:

import dataset-->import from txt-->heading, yes; tab, period, none

command showed as following:

> HZ254_phzH_Deseq.out <- read.delim("~/HZ254_phzH_Deseq out.txt", quote="")
> View(HZ254_phzH_Deseq.out)
> HZ254_genedata_ensembl <- read.delim("~/HZ254_genedata_ensembl.txt", quote="")
> View(HZ254_genedata_ensembl)
> M=merge(HZ254_phzH_Deseq.out,HZ254_genedata_ensembl,all=TRUE)
> write.table(M,file='HZ254_DEfull1.xls',row.names=F,quote=F,sep='\t')

command causing mess I mentioned above as following:

a= read.table("HZ254_phzH_Deseq out.txt",header=T,sep='\t')
b= read.table("HZ254_genedata_ensembl.txt",header=T,sep='\t') 
M=merge(a,b,all=TRUE)

and I also use the command derived from Rstudio, it also worked.

So, I think the problem is the way of data import, what is difference between the them?

Thanks!

ADD REPLYlink modified 11 months ago by _r_am31k • written 6.9 years ago by liupfskygre190

Hi liupfskygre.

My apologies for responding so late, but just saw your replies.

Am glad you've found the discussions here beneficial. I've seen so much expertise and have been quite impressed. Am not familiar with Rstudio, so I can't comment about the data import issue.

The best to you.

ADD REPLYlink written 6.8 years ago by Kenosis1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1061 users visited in the last hour