Question: merge multiple files by commom column names
0
gravatar for elhamidihay
27 days ago by
elhamidihay30
elhamidihay30 wrote:

I have multiple whitespace delimited files with a different number of columns. I would like to join them and keep the common column names.

test1_file

id SNP1 SNP2 SNP3
162 AA BB AA
172 AB BB AB

test2_file

id SNP1 SNP2 SNP3 SNP4
189 BB BB AA AB
1590 AA AB BB AA

test3_file

id SNP8 SNP2 SNP3 SNP9
182 AA BB AA AA
132 BB BB AA BB

desired output file:

id SNP2 SNP3
162 BB AA
172 BB AB
189 BB AA
1590 AB BB
182 BB AA
132 BB AA

my code: Its not working, and i would like to add more files than just two

import pandas as pd

test1_file = pd.read_csv('2014_07_03', delim_whitespace = True)
test2_file = pd.read_csv('2019_08_19', delim_whitespace = True)

f = pd.merge(left=test1_file, right=test2_file , how='left', on='MergeCol')
f.to_csv('outfile.csv')
pandas python • 236 views
ADD COMMENTlink modified 27 days ago by finswimmer13k • written 27 days ago by elhamidihay30
5
gravatar for finswimmer
27 days ago by
finswimmer13k
Germany
finswimmer13k wrote:

Hello,

you don't want to "merge" in the terms how pandas use it. You want to append the data to each other and drop columns that have no values. You can use this script, like this: python merge.py test1_file test2_file test3_file

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
import sys

import pandas


def get_args():
    parser = argparse.ArgumentParser(prog="merge.py", description="")
    parser.add_argument("--version", action="version", version="%(prog)s 0.1")

    parser.add_argument("input", nargs="+", help="input files")
    parser.add_argument(
        "-o",
        "--output",
        type=argparse.FileType("w"),
        default=sys.stdout,
        help="output file",
    )

    if len(sys.argv) == 1:
        parser.print_help(sys.stderr)
        sys.exit(1)

    args = parser.parse_args()

    return args


def main():
    args = get_args()
    result = pandas.DataFrame()

    for input_file in args.input:
        in_df = pandas.read_csv(input_file, sep=" ", header=0, dtype=object)
        result = result.append(in_df)

    result = result.set_index("id").dropna(axis="columns")

    result.to_csv(args.output, sep=" ")


if __name__ == "__main__":
    main()
ADD COMMENTlink written 27 days ago by finswimmer13k

Thank you very much for your help. i tested it with the sample data above but it doesn't seem to be working.

ADD REPLYlink written 26 days ago by elhamidihay30

Hello,

please add more detail about what "doesn't seem to be working".

Thanks.

fin swimmer

ADD REPLYlink written 26 days ago by finswimmer13k

this is what i get:

  FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  sort=sort,
id


BB

BB
BB
ADD REPLYlink modified 26 days ago by finswimmer13k • written 26 days ago by elhamidihay30

The warning is just a warning. You can ignore it for now or do what it suggest.

But there must be something strange with your input. I've tested it with the data you gave above and that's the output I get:

id SNP2 SNP3
162 BB AA
172 BB AB
189 BB AA
1590 AB BB
182 BB AA
132 BB AA
ADD REPLYlink written 26 days ago by finswimmer13k

Should the input files be a certain format? I get a syntax error.

ADD REPLYlink written 26 days ago by elhamidihay30

The columns must be separated by whitespaces and not tabs. That's the only requirement.

I get a syntax error.

When ever you receive any kind of error or something "doesn't work": Please give us the details!

fin swimmer

ADD REPLYlink written 26 days ago by finswimmer13k

Thanks.The whitespace delimited fixed it.

ADD REPLYlink modified 25 days ago • written 25 days ago by elhamidihay30

I VERY much appreciate your help

ADD REPLYlink written 25 days ago by elhamidihay30
3
gravatar for shenwei356
27 days ago by
shenwei3565.0k
China
shenwei3565.0k wrote:

A solution using parallel and csvtk:

# converting space-delimited to tab-delimited
l$ s *.txt | parallel 'csvtk space2tab {} > {}.tsv'

# retrieving colnames
$ ls *.txt.tsv | parallel 'csvtk headers -t {} | csvtk cut -t -f 2 > {}.colnames'

# common colnames
$ csvtk inter *.colnames
id
SNP3
SNP2

# cut by common colnames
$ ls *.txt.tsv | parallel 'csvtk cut -t -f $(csvtk inter *.colnames | paste -s -d ,) {} > {}.selected'

# concatenating
$ csvtk concat -t *.selected
id      SNP3    SNP2
162     AA      BB
172     AB      BB
189     AA      BB
1590    BB      AB
182     AA      BB
132     AA      BB
ADD COMMENTlink written 27 days ago by shenwei3565.0k

thank you for your help. Is there way to not have it print to the screen. i'm working with very large files.

ADD REPLYlink written 26 days ago by elhamidihay30
1

Learn some basic knowledge of SHELL: I/O Redirection

ADD REPLYlink written 26 days ago by shenwei3565.0k

thanks. with real datasets. i get an error message saying:

csvtk: Argument list too long

ADD REPLYlink written 25 days ago by elhamidihay30
2
gravatar for h.mon
27 days ago by
h.mon28k
Brazil
h.mon28k wrote:

Here is a solution with common unix tools.

Get number of files to be merged:

nfiles=$(ls -1 *_file | wc -l)

Get columns in common:

columns=$(head -q -n1 *_file \
  | tr " " "\n" \
  | sort \
  | uniq -c \
  | awk -v nf=$nfiles '$1 == nf {print $2}' \
  | tr "\n" " ")

Create awk script to print selected columns

cat << 'EOF' > t.awk
#https://unix.stackexchange.com/a/25144
BEGIN {
    split(cols,out," ")
}
NR==1 {
    for (i=1; i<=NF; i++)
        ix[$i] = i
}
NR>1 {
    for (i in out)
        printf "%s%s", $ix[out[i]], OFS
    print ""
}
EOF

Print header and the loop over files, printing selected columns

echo "$columns" >output.snps
for i in *_files
do
    awk -f t.awk -v cols="$columns" $i >>output.snps
done
ADD COMMENTlink written 27 days ago by h.mon28k

Hi,

Thanks for your help. it works with a small test data but with real datasets, it gives me an error in the last awk loop saying that

:"line 29: /bin/awk: Argument list too long"

ADD REPLYlink written 26 days ago by elhamidihay30

What are the sizes (rows vs columns) of the real datasets?

ADD REPLYlink written 26 days ago by h.mon28k

the largest file has 2500 rows and 56000 columns

ADD REPLYlink written 26 days ago by elhamidihay30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1000 users visited in the last hour