Question

Filter A Tsv File Content Based On Content From Other Tsv Files?

2

Entering edit mode

11.8 years ago

Egon Willighagen 5.4k

I have three TSV files, each with two columns. File one has gene name and effect, another probe and effect, and the third gene and probe. The first is by far the largest, and I want to visualize the full network in Cytoscape. Now, the first file has about 10M edges, and Cytoscape has trouble loading that.

However, the other files are much smaller, and in particular, the number of probes is relatively small. So, I am seeking a command line tool that can filter only lines (interactions) from the geneName-effect file for which the gene is found in the third file, and the effect in the second.

What command line set up can I use for this, or do I need to hack something up for this in a Perl, Python, or Groovy?

• 3.5k views

ADD COMMENT • link updated 11.8 years ago by Obi Griffith 20k • written 11.8 years ago by Egon Willighagen 5.4k

score 3 · Answer 1 · 2012-06-25

3

Entering edit mode

11.8 years ago

Sean Davis 26k

You might take a look at reading the data into R and then using the merge() function. As a bonus, you can use the RCytoscape package to talk directly to cytoscape....

ADD COMMENT • link 11.8 years ago by Sean Davis 26k

score 2 · Answer 2 · 2012-06-25

The *nix join command will do it. Do two separate commands to join file 1 to files 2 and 3. Command line options allow you to specify the column delimiters and the column index to match in each file. By default, non-matching lines are not included. Also note that you need to sort each file first. So Pierre's DB solution is better, but I would say these command line options might be faster and "good enough"...

score 0 · Answer 3 · 2012-06-25

0

Entering edit mode

11.8 years ago

Pierre Lindenbaum 161k

can you import the 3 files in 3 databases indexed in (for example) sqlite3 and select/join the 3 tables ?

ADD COMMENT • link 11.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

How would I do that? I have no sqlite3 experience... can that easily import TSV files then?

ADD REPLY • link 11.8 years ago by Egon Willighagen 5.4k

score 0 · Answer 4 · 2012-06-27

Another solution in R is to use the '%in%' to compare lists and find overlap. See example code:

setwd("~/test")
GeneEffect=read.table(file="GeneEffect.txt", header=TRUE, sep="\t")
GeneProbe=read.table(file="GeneProbe.txt", header=TRUE, sep="\t")
ProbeEffect=read.table(file="ProbeEffect.txt", header=TRUE, sep="\t")
GeneEffectFiltered=GeneEffect[which(GeneEffect[,"Gene"] %in% GeneProbe[,"Gene"] & GeneEffect[,"Effect"] %in% ProbeEffect[,"Effect"]),]
write.table(GeneEffectFiltered, file="results.txt", sep="\t", quote=FALSE, row.names=FALSE)