Question: Merge 2 files based on rs number but keep all info only from the first file
0
gravatar for joshf
2.1 years ago by
joshf20
Netherlands
joshf20 wrote:

I would like to merge two files by based on one column (rs number). Not all rs numbers in file 1 are in file 2, and not all in file 2 are in file 1. However, I wanna keep ALL lines from file 1 in the output.

File 1 (example):

rs    chr    bp    p
rs1    chr1    729679    0.9456
rs2    chr1    731718    0.7125
rs3    chr1    734349    0.7582
rs4    chr1    736289    0.7807
rs5    chr1    748878    0.3864

File 2 (example):

RSID CODING UTR PROMOTER DHS INTRON
rs2 0 0 0 0 1
rs3 0 0 0 0 1
rs1 0 0 1 1 0
rs4 0 0 0 1 0
rs6 0 1 0 1 0

In the case that an rs number is not found in file 2, I want to give this rs number the annotation: 0 0 0 0 0. so, hypothetically…based on my examples, this should be my output

Output (example):

rs    chr    bp    p    CODING    UTR    PROMOTER    DHS    INTRON
rs1    chr1    729679    0.9456    0    0    1    1    0
rs2    chr1    731718    0.7125    0    0    0    0    1
rs3    chr1    734349    0.7582    0    0    0    0    1
rs4    chr1    736289    0.7807    0    0    0    1    0
rs5    chr1    748878    0.3864    0    0    0    0    0

Merging files is fairly simple with a function like "join" but I have no idea how to keep the lines in file 1 without a column match with file 2, and in such a case, add the 0 0 0 0 0 annotation.

I really hope my question is clear, and any help is appreciated.

awk snp • 1.1k views
ADD COMMENTlink modified 2.1 years ago by Jorge Amigo10.0k • written 2.1 years ago by joshf20
4
gravatar for Kamil
2.1 years ago by
Kamil1.7k
Boston
Kamil1.7k wrote:

If you use Python, check out the pandas documentation for merging dataframes.

You might also consider writing a simple script:

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Kamil1.7k
3
gravatar for Pierre Lindenbaum
2.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum101k wrote:

use join with option -a

join -t ' ' -1 1 -2 1 -a 1 <(sort -t ' ' -k1,1 input1.txt) <(sort -t ' ' -k1,1 input2.txt) | awk -F ' ' '{printf("%s",$0); if(NF==4) printf(" 0 0 0 0 0");printf("\n");}'

rs1 chr1 729679 0.9456 0 0 1 1 0
rs2 chr1 731718 0.7125 0 0 0 0 1
rs3 chr1 734349 0.7582 0 0 0 0 1
rs4 chr1 736289 0.7807 0 0 0 1 0
rs5 chr1 748878 0.3864 0 0 0 0 0
RSID chr bp p CODING UTR PROMOTER DHS INTRON

 

ADD COMMENTlink written 2.1 years ago by Pierre Lindenbaum101k
3

I stumbled over a join command without the need of an extra awk call:

join -a 1 -t ' ' -e '0' -o '0,1.2,1.3,1.4,2.2,2.3,2.4,2.5' <(sort -t ' ' -k1,1 input1.txt) <(sort -t ' ' -k1,1 input2.txt)

It was said that the -e (Empty) option needs -o (output sequence). The 0 in -o is the joining field, followed by the file number and its column.  Thus if file number 2 has a missing entry, the -e argument is put on each field defined in -o 2.X.

ADD REPLYlink written 2.1 years ago by michael.ante1.8k

cool , I didn't know this option '-o'

ADD REPLYlink written 2.1 years ago by Pierre Lindenbaum101k
1
gravatar for Zhilong Jia
2.1 years ago by
Zhilong Jia1.1k
London
Zhilong Jia1.1k wrote:

In R with the help of dplyr package, 

 

# 1.t and 2.t are your two files

x1 <- read.table("./1.t", header=TRUE)
x2 <- read.table("./2.t", header=TRUE)

x3 <- dplyr::left_join(x1, x2, by=c("rs"="RSID"))

# Change NA to 0
x3[which(is.na(x3), arr.ind=TRUE)]=0

 

Show x3:

   rs  chr     bp      p CODING UTR PROMOTER DHS INTRON
1 rs1 chr1 729679 0.9456      0   0        1   1      0
2 rs2 chr1 731718 0.7125      0   0        0   0      1
3 rs3 chr1 734349 0.7582      0   0        0   0      1
4 rs4 chr1 736289 0.7807      0   0        0   1      0
5 rs5 chr1 748878 0.3864      0   0        0   0      0
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Zhilong Jia1.1k
1
gravatar for matteof
2.1 years ago by
matteof10
Italy
matteof10 wrote:

Really fast with Python: 

D = {}
f = open(FIRST_FILE, "r")
for l in f: D[ l.split("\t")[0].strip() ] = l.strip()
f.close()

f = open(SECOND_FILE, "r")
for l in f:
    rsid = l.split("\t")[0].strip()
try:
    print l.strip() + "\t" + D[rsid]
f.close()

 

 

ADD COMMENTlink written 2.1 years ago by matteof10
1
gravatar for Jorge Amigo
2.1 years ago by
Jorge Amigo10.0k
Santiago de Compostela, Spain
Jorge Amigo10.0k wrote:

and the awk alternative

awk '{
if (/^rs[0-9]/) { if (FNR==NR) f1[$1]=$0; else f2[$1]=$0 }
} END {
for (rs in f1) {
 if (f2[rs]) { print f1[rs],f2[rs] }
 else { print f1[rs],rs,0,0,0,0,0 }
}
}' file1.txt file2.txt \
| perl -pe 's/ +/\t/g' | cut -f-4,6- | sort -V
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Jorge Amigo10.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1360 users visited in the last hour