Question: Script for extracting data from a tab-delimited file
gravatar for vahapel
2.4 years ago by
vahapel150 wrote:

Dear All,

I have a number of accessions deposited in (i) a text file  and  (ii) a tab-delimited file;





calc.388956    ENSDARG00000001015|ENSDART00000001148|1791
calc.388956    ENSDARG00000001015|ENSDART00000001148|1791
calc.254634    ENSDARG00000000966|ENSDART00000001092|6325
calc.114655    ENSDARG00000006612|ENSDART00000002247|2167 (not found in accession list)
calc.333221    ENSDARG00000017121|ENSDART00000002309|3174
calc.333250    ENSDARG00000004697|ENSDART00000002554|1819
calc.329445    ENSDARG00000007812|ENSDART00000002247|2163

I am looking for a script for comparing these two files and find similar the accession numbers (i) in the tabular file (ii) and report the corresponding rows. 

I used a script;

grep -wFf accession.txt tabular_file.txt > result.txt

but, I think it is not working. 

Thank you for all your help !


sequence • 2.1k views
ADD COMMENTlink modified 2.4 years ago by venu4.3k • written 2.4 years ago by vahapel150
gravatar for 5heikki
2.4 years ago by
5heikki6.6k wrote:
join -1 1 -2 1 -o 1.1,1.2 -t "literalTab" <(sort -k1,1 yourFileWithTwoColumns) <(sort -k1,1 yourFileWithOneColumn) > output

Join on the basis of the first column of file1 (-1 1) and the first column of file2 (-2 1) and output fields 1 and 2 from first file (the rows where first field matched to the first field of the second file). Literal tab is ctrl + v + tab. If your files are already sorted by first field you can skip the sort commands and just give the files as input. Join is orders of magnitude faster than grep when you have two large files as with grep you're going through the second file as many times as there are lines in the first file. If you don't mind creating temp files the above command can maybe be a little bit faster still if you parallelize the sort commands:

sort -k1,1 yourFileWithTwoColumns > yourFileWithTwoColumnsSorted.tmp & sort -k1,1 yourFileWithOneColumn > yourFileWithOneColumnSorted.tmp && join -1 1 -2 1 -o 1.1,1.2 -t "literalTab" yourFileWithTwoColumnsSorted.tmp yourFileWithOneColumnSorted.tmp > output && rm yourFileWithTwoColumnSorted.tmp yourFileWithOneColumnSorted.tmp
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by 5heikki6.6k
gravatar for Devon Ryan
2.4 years ago by
Devon Ryan70k
Freiburg, Germany
Devon Ryan70k wrote:

The grep command you tried is likely working, but slowly. You could alternatively load both files into R and use inner_join() from the dplyr package. That's relatively quick. Writing a small program to do this in python would also be quite quick. In fact, it'd be something like:

#!/usr/bin/env python
IDs = set()

for line in open(file1,"r") :

for line in open(file2,"r") :
    elements = line.strip().split("\t")
    if elements[0] in IDs :
        print "%s\t%s" % (elements[0], elements[1])

What I wrote above is a total hack, but something along those lines would work.

ADD COMMENTlink written 2.4 years ago by Devon Ryan70k

I had a similar problem before, and Grep works, and several ways to speed up grep: set LC_ALL=C, use fgrep

ADD REPLYlink written 2.4 years ago by tangming20052.1k
gravatar for h.mon
2.4 years ago by
h.mon8.6k wrote:

the following R script should do what you want:

    args <- commandArgs(TRUE)
    accession <- read.table(args[1], sep="\t", header = F)
    tab.file <- read.table(args[2], sep="\t", header = F)
    row.indices <- which(tab.file[,1] %in% accession[,1])

just call it with:

    ./extract.R accession.txt tabular_file.txt out.txt

of course, you should have R installed, and you should not use this if your files are big ("big" depending on how much RAM do you have).

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by h.mon8.6k
gravatar for venu
2.4 years ago by
venu4.3k wrote:
Change the flag to
grep -Fwf accession.txt tabular_file.txt > result


ADD COMMENTlink written 2.4 years ago by venu4.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1385 users visited in the last hour