Combining all columns in multiple files into one based on shared common field
Entering edit mode
2.9 years ago
kmyers2 ▴ 60

I have a set of 38 txt files that all have a similar format: the first column is gene ID and the remaining columns are expression data. I want to join all of these files into one and retain all columns if and only if the first column of gene IDs are the same.

I have tried merge in Pandas, but get a memory error when I try (it does work with other data files though):

df_list = []

all_files = glob.glob("*meanCenter_results.txt")

for file in all_files:
    df_list.append(pd.read_csv(file, header = 0, sep = "\t", index_col = 0))

big_df = reduce(lambda left, right: pd.merge(left, right, on = "ORF_Gene", how = "outer"), df_list)
big_df.to_csv("All_GEO_Expression_Data_MeanCentered_Combined.txt", header = True, index = True, sep = "\t")

I found this code online and it seems to do what I want, but I'm really new to this particular kind of programming. For this I remanamed one file a.txt and the rest b1.txt to b37.txt:

temp=$(cat a.txt);for i in b*; do temp=$(echo $temp | join -j1 - $i); done; echo $temp

but this just writes it to the terminal window and it's too much to follow.

Can you suggest a way to get a single file, containing all the columns of data with the first column being the shared gene ID?

python pandas join merge temp • 2.0k views
Entering edit mode
2.9 years ago
Ram 36k

this just writes it to the terminal window and it's too much to follow.

Add >myoutfile.txt to the end of the command to redirect output to a file.

Also, you could literally just do

join a.txt b*.txt >out.txt
Entering edit mode


When I try the join command you suggest though, it provides the following without running:

usage: join [-a fileno | -v fileno ] [-e string] [-1 field] [-2 field]
        [-o list] [-t char] file1 file2

I'm not sure what to do. I've never used join before. I am able to join files if I provide the file names though.

Entering edit mode

Looks like you have a different version of join - probably BSD join that comes by default on a mac. I'd recommend installing gnu-coreutils through homebrew so you're working with GNU binaries (which are mostly better than BSD binaries).

If you'd rather not do that, you'll need to join files one by one. Essentially,

join a.txt b1.txt >out.txt
join out.txt b2.txt >out2.txt
join out2.txt b3.txt >out.txt
join out.txt b4.txt >out2.txt
join out2.txt b37.txt >out.txt

or use a loop to get past manually doing this. I strongly recommend switching to GNU coreutils. Here is a great guide to get you started on that path.


Login before adding your answer.

Traffic: 2327 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6