match three columns in three files and print to one
1
0
Entering edit mode
7.7 years ago
rob234king ▴ 600

I have three files with below format of contig, start position, stop position, coverage, other, example shown below. Not all the contigs are present in each file so I want to print to another file if the first, second and third column are found in each of the three files. I can then sort them afterwards. I've written a perl script for this but I would be embaressed to show it, I'm hopeful that there is a command line one liner that can do this or quick program? as mine will take days.

FileA

IWGSC_CSS_5AS_scaff_1501710     0       10000   229     3
IWGSC_CSS_5AS_scaff_1501710     10000   16194   206     2
IWGSC_CSS_4BL_scaff_7036768     0       10000   270     4


FileB

IWGSC_CSS_5AS_scaff_1501710     0       10000   229     3
IWGSC_CSS_4BL_scaff_7036768     0       10000   170     4


FileC

IWGSC_CSS_4BL_scaff_7036768     0       10000   370     4


Final file

IWGSC_CSS_4BL_scaff_7036768     0       10000       270      170      370

command-line perl • 2.5k views
0
Entering edit mode

If you really wanted a command line option, then you could merge the first 3 columns

awk '{printf("%s:%s:%s\t%s\t%s\n", $1,$2,$3,$4,$5)}' foo.txt > foo.merged.txt  and then use the join command. Practically speaking, that'd end up being a few lines (you could do it on one, but it'd be really long and overly complicated), so you'd have to decide if that's acceptable or not. The file could then be reformatted simply with sed. ADD REPLY 0 Entering edit mode I dont clearly understand the file format and the task IWGSC_CSS_5AS_scaff_1501710(id) 0(start) 10000(end) 229(coverage) 3(other)  is this correct? do the files contain many lines like this? does the id is unique? ADD REPLY 0 Entering edit mode The first column is a contig of some sort, so it's not unique (cf. the first two lines of the example). ADD REPLY 0 Entering edit mode thanks all for the response, yes the contig is not unique as there are different start and stop positions as the coverage has been calculated over a 10k window and many contigs larger than 10k. I'll have a look at merging the columns but do you have an example please how to use join to merge the files of which have different contigs in so not joining the same line (need some kind of match I think then print columns 1,2,3 then 4 from each of the three files. I'll expand upon the question to make it more clear of the final output. ADD REPLY 0 Entering edit mode In the new question why IWGSC_CSS_5AS_scaff_1501710 0 10000 229 3  is not in the output since it is found on both file a and b? ADD REPLY 1 Entering edit mode It's not in fileC, reread the question. ADD REPLY 0 Entering edit mode I only want the fourth column from each file, the fifth is redundant data, in a single entry that corresponds to the contig and start and stop position to which the coverage was calculated. ADD REPLY 4 Entering edit mode 7.7 years ago Here's an example of a multi-liner, which could be one line but would be overly complicated: #Merge the first 3 columns of each file awk '{printf("%s:%s:%s\t%s\t%s\n",$1,$2,$3,$4,$5)}' A.txt > A.merged.txt
awk '{printf("%s:%s:%s\t%s\t%s\n", $1,$2,$3,$4,$5)}' B.txt > B.merged.txt awk '{printf("%s:%s:%s\t%s\t%s\n",$1,$2,$3,$4,$5)}' C.txt > C.merged.txt
#Perform the join. The "<(sort foo.merged)" is just process substitution
#and "-" mean "standard input"
#You could write everything to files if it makes things clearer
join -j 1 <(sort A.merged) <(sort B.merged) | join -j 1 - <(sort C.merged) | sed 's/:/\t/g' | sed 's/ /\t/g' | cut -f 1,2,3,4,6,8


sed 's/:/\t/g' reverts the awk lines above. Apparently join mucks with the formatting, so I replaced the spaces it adds (I'm sure there's an option to do this automatically, but I'm too lazy to look it up) with tabs via sed 's/ /\t/g'`.

0
Entering edit mode

You have saved me hours of messing about trying to do this, thanks very much!