Question

how to intersect two files based on three columns

0

Entering edit mode

3.0 years ago

Apex92 ▴ 280

I have two files as below. I want to search for every item in every column in the f1 and see it if exists inside column 3 of the f2 file (for each column of f1 separately). If any exists then write the related chr of that item from the chr column of f2 (the first column of f2).

f1:(two columns)

gene_symbol,lncRNA_symbol
kctd,Malat1
Gas5,Snhg6

f2:(three columns)

chr,gene_type,gene_symbol
chr7,snRNA,Gas5
chr1,protein_coding,Malat1
chr2,TEC,Snhg6
chr1,TEC,kctd

So based on the two files mentioned the desired output should be:

f3:

gene_symbol,lncRNA_symbol,gene_chr,lncRNA_chr
kctd,Malat1,chr1,chr1
Gas5,Snhg6,chr7,chr2

I do not have a strong programming background and found this very complex - Even though I have checked multiple sources but have not been able to develop a solution - any help is appreciated. Thanks

python awk bash • 1.2k views

ADD COMMENT • link updated 3.0 years ago by 5heikki 11k • written 3.0 years ago by Apex92 ▴ 280

1

Entering edit mode

3.0 years ago

Alex Reynolds 35k

This tries to deal with cases where information might be missing:

#!/usr/bin/env python

""" so9463765.py -- https://www.biostars.org/p/9463765/ """

import sys
import os
import pandas as pd

f1 = "f1.txt"
f2 = "f2.txt"

if not os.path.exists(f1) or not os.path.exists(f2):
    raise OSError("File(s) not found")

h = [ "gene_symbol", "lncRNA_symbol", "gene_chr", "lncRNA_chr" ]
s = {}
with open(f1, "r") as f1h, open(f2, "r") as f2h:
    f1i = 0
    for line in f1h:
        if f1i > 0:
            (gene_symbol,lncRNA_symbol) = line.rstrip().split(',')
            s[gene_symbol] = { "v" : [gene_symbol,lncRNA_symbol], "t" : "gene_symbol" }
            s[lncRNA_symbol] = { "v" : [gene_symbol,lncRNA_symbol], "t" : "lncRNA_symbol" }
        f1i += 1
    ol = []
    f2i = 0
    for line in f2h:
        if f2i > 0:
            (chrom, gene_type, gene_symbol) = line.rstrip().split(',')
            o = {x: "" for x in h}
            if gene_symbol in s:
                if s[gene_symbol]["t"] == "gene_symbol":
                    o["gene_symbol"] = gene_symbol
                    o["gene_chr"] = chrom
                    o["lncRNA_symbol"] = s[gene_symbol]["v"][1]
                elif s[gene_symbol]["t"] == "lncRNA_symbol":
                    o["gene_symbol"] = s[gene_symbol]["v"][0]
                    o["lncRNA_symbol"] = gene_symbol
                    o["lncRNA_chr"] = chrom
            ol.append(o)
        f2i += 1
    df = pd.DataFrame(ol)
    # collapse by column values
    of = df.groupby(["gene_symbol", "lncRNA_symbol"], as_index=False).agg("".join)
    sys.stdout.write(of.to_csv(index=False))

Output:

% ./so9463765.py
gene_symbol,lncRNA_symbol,gene_chr,lncRNA_chr
Gas5,Snhg6,chr7,chr2
kctd,Malat1,chr1,chr1

A bit convoluted, but hopefully this gets you somewhere.

ADD COMMENT • link 3.0 years ago by Alex Reynolds 35k

0

Entering edit mode

3.0 years ago

5heikki 11k

I'm not sure this is parsing the exact fields you wanted (you set those with the -o in join):

join -1 1 -2 3 -t "," -o 1.1,1.2,2.1 <(sort -t "," -k1,1 f1) <(sort -t "," -k3,3 f2)

ADD COMMENT • link 3.0 years ago by 5heikki 11k

score 2 · Accepted Answer · 2021-04-06

awk 'BEGIN{FS=OFS=","}NR==FNR{gene[$3]=$1; next}{ print ($1, $2, gene[$1], gene[$2] ) }' 2.t 1.t

gene_symbol,lncRNA_symbol,chr,
kctd,Malat1,chr1,chr1
Gas5,Snhg6,chr7,chr2

cat 1.t

gene_symbol,lncRNA_symbol
kctd,Malat1
Gas5,Snhg6

cat 2.t

chr,gene_type,gene_symbol
chr7,snRNA,Gas5
chr1,protein_coding,Malat1
chr2,TEC,Snhg6
chr1,TEC,kctd