I have a set of 38 txt files that all have a similar format: the first column is gene ID and the remaining columns are expression data. I want to join all of these files into one and retain all columns if and only if the first column of gene IDs are the same.
I have tried merge in Pandas, but get a memory error when I try (it does work with other data files though):
df_list = []
all_files = glob.glob("*meanCenter_results.txt")
for file in all_files:
df_list.append(pd.read_csv(file, header = 0, sep = "\t", index_col = 0))
big_df = reduce(lambda left, right: pd.merge(left, right, on = "ORF_Gene", how = "outer"), df_list)
big_df.to_csv("All_GEO_Expression_Data_MeanCentered_Combined.txt", header = True, index = True, sep = "\t")
I found this code online and it seems to do what I want, but I'm really new to this particular kind of programming. For this I remanamed one file a.txt and the rest b1.txt to b37.txt:
temp=$(cat a.txt);for i in b*; do temp=$(echo $temp | join -j1 - $i); done; echo $temp
but this just writes it to the terminal window and it's too much to follow.
Can you suggest a way to get a single file, containing all the columns of data with the first column being the shared gene ID?
Thanks!
When I try the join command you suggest though, it provides the following without running:
I'm not sure what to do. I've never used join before. I am able to join files if I provide the file names though.
Looks like you have a different version of join - probably BSD join that comes by default on a mac. I'd recommend installing gnu-coreutils through homebrew so you're working with GNU binaries (which are mostly better than BSD binaries).
If you'd rather not do that, you'll need to join files one by one. Essentially,
or use a loop to get past manually doing this. I strongly recommend switching to GNU coreutils. Here is a great guide to get you started on that path.