Question: match three columns in three files and print to one
0
gravatar for rob234king
6.8 years ago by
rob234king600
UK/Harpenden/Rothamsted Research
rob234king600 wrote:

I have three files with below format of contig, start position, stop position, coverage, other, example shown below. Not all the contigs are present in each file so I want to print to another file if the first, second and third column are found in each of the three files. I can then sort them afterwards. I've written a perl script for this but I would be embaressed to show it, I'm hopeful that there is a command line one liner that can do this or quick program? as mine will take days.

FileA

IWGSC_CSS_5AS_scaff_1501710     0       10000   229     3
IWGSC_CSS_5AS_scaff_1501710     10000   16194   206     2
IWGSC_CSS_4BL_scaff_7036768     0       10000   270     4

FileB

IWGSC_CSS_5AS_scaff_1501710     0       10000   229     3
IWGSC_CSS_4BL_scaff_7036768     0       10000   170     4

FileC

IWGSC_CSS_4BL_scaff_7036768     0       10000   370     4

 

Final file

IWGSC_CSS_4BL_scaff_7036768     0       10000       270      170      370

command line perl • 2.3k views
ADD COMMENTlink modified 6.8 years ago by Devon Ryan98k • written 6.8 years ago by rob234king600

If you really wanted a command line option, then you could merge the first 3 columns

awk '{printf("%s:%s:%s\t%s\t%s\n", $1,$2,$3,$4,$5)}' foo.txt > foo.merged.txt

and then use the join command. Practically speaking, that'd end up being a few lines (you could do it on one, but it'd be really long and overly complicated), so you'd have to decide if that's acceptable or not. The file could then be reformatted simply with sed.

ADD REPLYlink modified 14 months ago by Ram32k • written 6.8 years ago by Devon Ryan98k

I dont clearly understand the file format and the task

IWGSC_CSS_5AS_scaff_1501710(id)    0(start)       10000(end)   229(coverage)     3(other)

is this correct?

do the files contain many lines like this?

does the id is unique?

ADD REPLYlink modified 14 months ago by Ram32k • written 6.8 years ago by mostafa.shokrof0

The first column is a contig of some sort, so it's not unique (cf. the first two lines of the example).

ADD REPLYlink written 6.8 years ago by Devon Ryan98k

thanks all for the response, yes the contig is not unique as there are different start and stop positions as the coverage has been calculated over a 10k window and many contigs larger than 10k.

I'll have a look at merging the columns but do you have an example please how to use join to merge the files of which have different contigs in so not joining the same line (need some kind of match I think then print columns 1,2,3 then 4 from each of the three files.

I'll expand upon the question to make it more clear of the final output.

ADD REPLYlink modified 14 months ago by Ram32k • written 6.8 years ago by rob234king600

In the new question why

IWGSC_CSS_5AS_scaff_1501710     0       10000   229     3

is not in the output since it is found on both file a and b?

ADD REPLYlink modified 14 months ago by Ram32k • written 6.8 years ago by mostafa.shokrof0
1

It's not in fileC, reread the question.

ADD REPLYlink written 6.8 years ago by Devon Ryan98k

I only want the fourth column from each file, the fifth is redundant data, in a single entry that corresponds to the contig and start and stop position to which the coverage was calculated.

ADD REPLYlink written 6.8 years ago by rob234king600
4
gravatar for Devon Ryan
6.8 years ago by
Devon Ryan98k
Freiburg, Germany
Devon Ryan98k wrote:

Here's an example of a multi-liner, which could be one line but would be overly complicated:

#Merge the first 3 columns of each file
awk '{printf("%s:%s:%s\t%s\t%s\n", $1,$2,$3,$4,$5)}' A.txt > A.merged.txt
awk '{printf("%s:%s:%s\t%s\t%s\n", $1,$2,$3,$4,$5)}' B.txt > B.merged.txt
awk '{printf("%s:%s:%s\t%s\t%s\n", $1,$2,$3,$4,$5)}' C.txt > C.merged.txt
#Perform the join`. The "<(sort foo.merged)" is just process substitution
#and "-" mean "standard input"
#You could write everything to files if it makes things clearer
join -j 1 <(sort A.merged) <(sort B.merged) | join -j 1 - <(sort C.merged) | sed 's/:/\t/g' | sed 's/ /\t/g' | cut -f 1,2,3,4,6,8

sed 's/:/\t/g' reverts the awk lines above. Apparently join mucks with the formatting, so I replaced the spaces it adds (I'm sure there's an option to do this automatically, but I'm too lazy to look it up) with tabs via sed 's/ /\t/g'.

ADD COMMENTlink modified 14 months ago by Ram32k • written 6.8 years ago by Devon Ryan98k

You have saved me hours of messing about trying to do this, thanks very much!

ADD REPLYlink written 6.8 years ago by rob234king600
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1811 users visited in the last hour
_