I have three files with below format of contig, start position, stop position, coverage, other, example shown below. Not all the contigs are present in each file so I want to print to another file if the first, second and third column are found in each of the three files. I can then sort them afterwards. I've written a perl script for this but I would be embaressed to show it, I'm hopeful that there is a command line one liner that can do this or quick program? as mine will take days.
FileA
IWGSC_CSS_5AS_scaff_1501710 0 10000 229 3
IWGSC_CSS_5AS_scaff_1501710 10000 16194 206 2
IWGSC_CSS_4BL_scaff_7036768 0 10000 270 4
FileB
IWGSC_CSS_5AS_scaff_1501710 0 10000 229 3
IWGSC_CSS_4BL_scaff_7036768 0 10000 170 4
FileC
IWGSC_CSS_4BL_scaff_7036768 0 10000 370 4
Final file
IWGSC_CSS_4BL_scaff_7036768 0 10000 270 170 370
If you really wanted a command line option, then you could merge the first 3 columns
and then use the
join
command. Practically speaking, that'd end up being a few lines (you could do it on one, but it'd be really long and overly complicated), so you'd have to decide if that's acceptable or not. The file could then be reformatted simply withsed
.I dont clearly understand the file format and the task
is this correct?
do the files contain many lines like this?
does the id is unique?
The first column is a contig of some sort, so it's not unique (cf. the first two lines of the example).
thanks all for the response, yes the contig is not unique as there are different start and stop positions as the coverage has been calculated over a 10k window and many contigs larger than 10k.
I'll have a look at merging the columns but do you have an example please how to use join to merge the files of which have different contigs in so not joining the same line (need some kind of match I think then print columns 1,2,3 then 4 from each of the three files.
I'll expand upon the question to make it more clear of the final output.
In the new question why
is not in the output since it is found on both file a and b?
It's not in fileC, reread the question.
I only want the fourth column from each file, the fifth is redundant data, in a single entry that corresponds to the contig and start and stop position to which the coverage was calculated.