Question: Merge 2 files based on rs number but keep all info only from the first file
0
gravatar for joshf
23 months ago by
joshf20
Netherlands
joshf20 wrote:

I would like to merge two files by based on one column (rs number). Not all rs numbers in file 1 are in file 2, and not all in file 2 are in file 1. However, I wanna keep ALL lines from file 1 in the output.

File 1 (example):

rs    chr    bp    p
rs1    chr1    729679    0.9456
rs2    chr1    731718    0.7125
rs3    chr1    734349    0.7582
rs4    chr1    736289    0.7807
rs5    chr1    748878    0.3864

File 2 (example):

RSID CODING UTR PROMOTER DHS INTRON
rs2 0 0 0 0 1
rs3 0 0 0 0 1
rs1 0 0 1 1 0
rs4 0 0 0 1 0
rs6 0 1 0 1 0

In the case that an rs number is not found in file 2, I want to give this rs number the annotation: 0 0 0 0 0. so, hypothetically…based on my examples, this should be my output

Output (example):

rs    chr    bp    p    CODING    UTR    PROMOTER    DHS    INTRON
rs1    chr1    729679    0.9456    0    0    1    1    0
rs2    chr1    731718    0.7125    0    0    0    0    1
rs3    chr1    734349    0.7582    0    0    0    0    1
rs4    chr1    736289    0.7807    0    0    0    1    0
rs5    chr1    748878    0.3864    0    0    0    0    0

Merging files is fairly simple with a function like "join" but I have no idea how to keep the lines in file 1 without a column match with file 2, and in such a case, add the 0 0 0 0 0 annotation.

I really hope my question is clear, and any help is appreciated.

awk snp • 1.0k views
ADD COMMENTlink modified 23 months ago by Jorge Amigo9.8k • written 23 months ago by joshf20
4
gravatar for Kamil
23 months ago by
Kamil1.6k
Boston
Kamil1.6k wrote:

If you use Python, check out the pandas documentation for merging dataframes.

You might also consider writing a simple script:

ADD COMMENTlink modified 23 months ago • written 23 months ago by Kamil1.6k
3
gravatar for Pierre Lindenbaum
23 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum98k wrote:

use join with option -a

join -t ' ' -1 1 -2 1 -a 1 <(sort -t ' ' -k1,1 input1.txt) <(sort -t ' ' -k1,1 input2.txt) | awk -F ' ' '{printf("%s",$0); if(NF==4) printf(" 0 0 0 0 0");printf("\n");}'

rs1 chr1 729679 0.9456 0 0 1 1 0
rs2 chr1 731718 0.7125 0 0 0 0 1
rs3 chr1 734349 0.7582 0 0 0 0 1
rs4 chr1 736289 0.7807 0 0 0 1 0
rs5 chr1 748878 0.3864 0 0 0 0 0
RSID chr bp p CODING UTR PROMOTER DHS INTRON

 

ADD COMMENTlink written 23 months ago by Pierre Lindenbaum98k
3

I stumbled over a join command without the need of an extra awk call:

join -a 1 -t ' ' -e '0' -o '0,1.2,1.3,1.4,2.2,2.3,2.4,2.5' <(sort -t ' ' -k1,1 input1.txt) <(sort -t ' ' -k1,1 input2.txt)

It was said that the -e (Empty) option needs -o (output sequence). The 0 in -o is the joining field, followed by the file number and its column.  Thus if file number 2 has a missing entry, the -e argument is put on each field defined in -o 2.X.

ADD REPLYlink written 23 months ago by michael.ante1.6k

cool , I didn't know this option '-o'

ADD REPLYlink written 23 months ago by Pierre Lindenbaum98k
1
gravatar for Zhilong Jia
23 months ago by
Zhilong Jia1.0k
London
Zhilong Jia1.0k wrote:

In R with the help of dplyr package, 

 

# 1.t and 2.t are your two files

x1 <- read.table("./1.t", header=TRUE)
x2 <- read.table("./2.t", header=TRUE)

x3 <- dplyr::left_join(x1, x2, by=c("rs"="RSID"))

# Change NA to 0
x3[which(is.na(x3), arr.ind=TRUE)]=0

 

Show x3:

   rs  chr     bp      p CODING UTR PROMOTER DHS INTRON
1 rs1 chr1 729679 0.9456      0   0        1   1      0
2 rs2 chr1 731718 0.7125      0   0        0   0      1
3 rs3 chr1 734349 0.7582      0   0        0   0      1
4 rs4 chr1 736289 0.7807      0   0        0   1      0
5 rs5 chr1 748878 0.3864      0   0        0   0      0
ADD COMMENTlink modified 23 months ago • written 23 months ago by Zhilong Jia1.0k
1
gravatar for matteof
23 months ago by
matteof10
Italy
matteof10 wrote:

Really fast with Python: 

D = {}
f = open(FIRST_FILE, "r")
for l in f: D[ l.split("\t")[0].strip() ] = l.strip()
f.close()

f = open(SECOND_FILE, "r")
for l in f:
    rsid = l.split("\t")[0].strip()
try:
    print l.strip() + "\t" + D[rsid]
f.close()

 

 

ADD COMMENTlink written 23 months ago by matteof10
1
gravatar for Jorge Amigo
23 months ago by
Jorge Amigo9.8k
Santiago de Compostela, Spain
Jorge Amigo9.8k wrote:

and the awk alternative

awk '{
if (/^rs[0-9]/) { if (FNR==NR) f1[$1]=$0; else f2[$1]=$0 }
} END {
for (rs in f1) {
 if (f2[rs]) { print f1[rs],f2[rs] }
 else { print f1[rs],rs,0,0,0,0,0 }
}
}' file1.txt file2.txt \
| perl -pe 's/ +/\t/g' | cut -f-4,6- | sort -V
ADD COMMENTlink modified 23 months ago • written 23 months ago by Jorge Amigo9.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1394 users visited in the last hour