Filter A Tsv File Content Based On Content From Other Tsv Files?
4
2
Entering edit mode
12.4 years ago

I have three TSV files, each with two columns. File one has gene name and effect, another probe and effect, and the third gene and probe. The first is by far the largest, and I want to visualize the full network in Cytoscape. Now, the first file has about 10M edges, and Cytoscape has trouble loading that.

However, the other files are much smaller, and in particular, the number of probes is relatively small. So, I am seeking a command line tool that can filter only lines (interactions) from the geneName-effect file for which the gene is found in the third file, and the effect in the second.

What command line set up can I use for this, or do I need to hack something up for this in a Perl, Python, or Groovy?

• 3.7k views
ADD COMMENT
3
Entering edit mode
12.4 years ago

You might take a look at reading the data into R and then using the merge() function. As a bonus, you can use the RCytoscape package to talk directly to cytoscape....

ADD COMMENT
2
Entering edit mode
12.4 years ago
Andrew Su 4.9k

The *nix join command will do it. Do two separate commands to join file 1 to files 2 and 3. Command line options allow you to specify the column delimiters and the column index to match in each file. By default, non-matching lines are not included. Also note that you need to sort each file first. So Pierre's DB solution is better, but I would say these command line options might be faster and "good enough"...

ADD COMMENT
0
Entering edit mode
12.4 years ago

can you import the 3 files in 3 databases indexed in (for example) sqlite3 and select/join the 3 tables ?

ADD COMMENT
0
Entering edit mode

How would I do that? I have no sqlite3 experience... can that easily import TSV files then?

ADD REPLY
0
Entering edit mode
12.4 years ago

Another solution in R is to use the '%in%' to compare lists and find overlap. See example code:

setwd("~/test")
GeneEffect=read.table(file="GeneEffect.txt", header=TRUE, sep="\t")
GeneProbe=read.table(file="GeneProbe.txt", header=TRUE, sep="\t")
ProbeEffect=read.table(file="ProbeEffect.txt", header=TRUE, sep="\t")
GeneEffectFiltered=GeneEffect[which(GeneEffect[,"Gene"] %in% GeneProbe[,"Gene"] & GeneEffect[,"Effect"] %in% ProbeEffect[,"Effect"]),]
write.table(GeneEffectFiltered, file="results.txt", sep="\t", quote=FALSE, row.names=FALSE)
ADD COMMENT

Login before adding your answer.

Traffic: 786 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6