Script for extracting data from a tab-delimited file
4
0
Entering edit mode
9.0 years ago
vahapel ▴ 210

Dear All,

I have a number of accessions deposited in (i) a text file and (ii) a tab-delimited file;

(i)

calc.388956
calc.254634
calc.333221
calc.333250
..

(ii)

calc.388956    ENSDARG00000001015|ENSDART00000001148|1791
calc.388956    ENSDARG00000001015|ENSDART00000001148|1791
calc.254634    ENSDARG00000000966|ENSDART00000001092|6325
calc.114655    ENSDARG00000006612|ENSDART00000002247|2167 #not found in accession list
calc.333221    ENSDARG00000017121|ENSDART00000002309|3174
calc.333250    ENSDARG00000004697|ENSDART00000002554|1819
calc.329445    ENSDARG00000007812|ENSDART00000002247|2163
..

I am looking for a script for comparing these two files and find similar the accession numbers (i) in the tabular file (ii) and report the corresponding rows.

I used a script

grep -wFf accession.txt tabular_file.txt > result.txt

but, I think it is not working.

Thank you for all your help!

sequence • 7.6k views
ADD COMMENT
6
Entering edit mode
9.0 years ago
5heikki 11k
join -1 1 -2 1 -o 1.1,1.2 -t "literalTab" <(sort -k1,1 yourFileWithTwoColumns) <(sort -k1,1 yourFileWithOneColumn) > output

Join on the basis of the first column of file1 (-1 1) and the first column of file2 (-2 1) and output fields 1 and 2 from first file (the rows where first field matched to the first field of the second file). Literal tab is ctrl + v + tab. If your files are already sorted by first field you can skip the sort commands and just give the files as input. Join is orders of magnitude faster than grep when you have two large files as with grep you're going through the second file as many times as there are lines in the first file. If you don't mind creating temp files the above command can maybe be a little bit faster still if you parallelize the sort commands:

sort -k1,1 yourFileWithTwoColumns > yourFileWithTwoColumnsSorted.tmp & \
sort -k1,1 yourFileWithOneColumn > yourFileWithOneColumnSorted.tmp && \
join -1 1 -2 1 -o 1.1,1.2 -t "literalTab" yourFileWithTwoColumnsSorted.tmp yourFileWithOneColumnSorted.tmp > output && \
rm yourFileWithTwoColumnSorted.tmp yourFileWithOneColumnSorted.tmp
ADD COMMENT
5
Entering edit mode
9.0 years ago

The grep command you tried is likely working, but slowly. You could alternatively load both files into R and use inner_join() from the dplyr package. That's relatively quick. Writing a small program to do this in python would also be quite quick. In fact, it'd be something like:

#!/usr/bin/env python
file1="something.txt"
file2="something_else.txt"
IDs = set()

for line in open(file1,"r") :
    IDs.add(line.strip())

for line in open(file2,"r") :
    elements = line.strip().split("\t")
    if elements[0] in IDs :
        print "%s\t%s" % (elements[0], elements[1])

What I wrote above is a total hack, but something along those lines would work.

ADD COMMENT
0
Entering edit mode

I had a similar problem before, and Grep works, and several ways to speed up grep: set LC_ALL=C, use fgrep

http://crazyhottommy.blogspot.com/search/label/grep

ADD REPLY
2
Entering edit mode
9.0 years ago
h.mon 35k

The following R script should do what you want:

#!/usr/bin/Rscript
args <- commandArgs(TRUE)
accession <- read.table(args[1], sep="\t", header = F)
tab.file <- read.table(args[2], sep="\t", header = F)
row.indices <- which(tab.file[,1] %in% accession[,1])
write.table(row.indices,file=args[3],quote=F,row.names=F,col.names=F)

just call it with:

./extract.R accession.txt tabular_file.txt out.txt

of course, you should have R installed, and you should not use this if your files are big ("big" depending on how much RAM do you have).

ADD COMMENT
2
Entering edit mode
9.0 years ago
venu 7.1k

Change the flag to

grep -Fwf accession.txt tabular_file.txt > result
ADD COMMENT

Login before adding your answer.

Traffic: 3287 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6