I have three files with below format of contig, start position, stop position, coverage, other, example shown below. Not all the contigs are present in each file so I want to print to another file if the first, second and third column are found in each of the three files. I can then sort them afterwards. I've written a perl script for this but I would be embaressed to show it, I'm hopeful that there is a command line one liner that can do this or quick program? as mine will take days.
FileA
IWGSC_CSS_5AS_scaff_1501710     0       10000   229     3
IWGSC_CSS_5AS_scaff_1501710     10000   16194   206     2
IWGSC_CSS_4BL_scaff_7036768     0       10000   270     4
FileB
IWGSC_CSS_5AS_scaff_1501710     0       10000   229     3
IWGSC_CSS_4BL_scaff_7036768     0       10000   170     4
FileC
IWGSC_CSS_4BL_scaff_7036768     0       10000   370     4
Final file
IWGSC_CSS_4BL_scaff_7036768     0       10000       270      170      370
                    
                
                
If you really wanted a command line option, then you could merge the first 3 columns
and then use the
joincommand. Practically speaking, that'd end up being a few lines (you could do it on one, but it'd be really long and overly complicated), so you'd have to decide if that's acceptable or not. The file could then be reformatted simply withsed.I dont clearly understand the file format and the task
is this correct?
do the files contain many lines like this?
does the id is unique?
The first column is a contig of some sort, so it's not unique (cf. the first two lines of the example).
thanks all for the response, yes the contig is not unique as there are different start and stop positions as the coverage has been calculated over a 10k window and many contigs larger than 10k.
I'll have a look at merging the columns but do you have an example please how to use join to merge the files of which have different contigs in so not joining the same line (need some kind of match I think then print columns 1,2,3 then 4 from each of the three files.
I'll expand upon the question to make it more clear of the final output.
In the new question why
is not in the output since it is found on both file a and b?
It's not in fileC, reread the question.
I only want the fourth column from each file, the fifth is redundant data, in a single entry that corresponds to the contig and start and stop position to which the coverage was calculated.