Question

find matching gene symbol

0

Entering edit mode

7.2 years ago

zizigolu ★ 4.3k

Hello,

I have two files;

File1

   ****ID_REF   VALUE1   VALUE2   VALUE3****

**ILMN_1762337** 5.650968 5.296173 5.310465

**ILMN_2055271** 5.457874 5.513952 5.453695

**ILMN_1736007** 5.159924 5.342681 5.364833

**ILMN_2383229** 5.564290 5.589798 5.389733

and File 2

**ID_REF   gene-symbol**

ILMN_1673207    

ILMN_1673214    

ILMN_1673215    PCBP1

ILMN_1673222    ZBTB10

ILMN_1673224    

ILMN_1673232    CERS1

ILMN_1673234    PKDREJ

ILMN_1673238    CLEC9A

ILMN_1673241

how I can have a file containing columns of file 1 (except for column 1) and matched column 2 from file 2 (the columns 1 from both files are the same, I need matched gene-symbol.

R gene • 2.0k views

ADD COMMENT • link updated 7.2 years ago by Petr Ponomarenko ★ 2.8k • written 7.2 years ago by zizigolu ★ 4.3k

score 3 · Accepted Answer · 2017-02-10

3

Entering edit mode

7.2 years ago

EagleEye 7.5k

Using R,

file1 <- read.table("file1.txt", header=T, sep="\t")
file2 <- read.table("file2.txt", header=T, sep="\t", fill=TRUE)
mergedFile <- merge(file1, file2, by="ID_REF")
write.table(mergedFile, "./file1_WithSym.txt", sep="\t", row.names=FALSE, quote=FALSE)

You will get a file with these columns

ID_REF <TAB> VALUE1 <TAB> VALUE2 <TAB> VALUE3 <TAB> gene-symbol

ADD COMMENT • link 7.2 years ago by EagleEye 7.5k

0

Entering edit mode

thank you

now I have gene symbol in the last column but some rows don't have gene symbol. how I can remove them?

thanks again

ADD REPLY • link 7.2 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Is your both files separated by <tab> ?? And check if the first column names in both files are same !!!

ADD REPLY • link 7.2 years ago by EagleEye 7.5k

0

Entering edit mode

thank you,you all right I edited the name.

ADD REPLY • link 7.2 years ago by zizigolu ★ 4.3k

score 3 · Accepted Answer · 2017-02-10

You got the right answer with R already, but maybe a solution with awk will be useful too. It is good to now many ways to do the same thing and awk is part of standard Unix shell (so it is native to Mac's Terminal as well).

awk 'NR==FNR{gene_symbol[$1]=$2}(NR!=FNR&&gene_symbol[$1]){print $0, gene_symbol[$1]}' File2 File1

awk reads files line by line. Here in NR==FNR block, it reads File2 and stores all gene symbols to an array addressed by the first column $1. Then it reads the File1 in block NR!=FNR where it also checks if gene symbol for $1 is not an empty. If it is not empty then it prints the whole line of File1 to stdout and adds another column with the gene symbol.

I do not know what is the meaning of star symbols () in your files and if this is critical to preserving them. You say first columns are identical, but in your example, they are not the same. If column 1 in File1 is the same as column 1 in File2 plus two-star signs (*) before and after it then you can change the awk one-liner to this:

awk 'NR==FNR{gene_symbol["**"$1"**"]=$2}(NR!=FNR&&gene_symbol[$1]){print $0, gene_symbol[$1]}' File2 File1

and if you want *ID_REF in the header of File1 and *ID_REF in the header of File2 you may change awk command to:

awk 'BEGIN{print "   ****ID_REF   VALUE1   VALUE2   VALUE3   gene-symbol****"}NR==FNR{gene_symbol["**"$1"**"]=$2}(NR!=FNR&&gene_symbol[$1]){print $0, gene_symbol[$1]}' File2 File1