Question

Script for extracting data from a tab-delimited file

0

Entering edit mode

9.0 years ago

vahapel ▴ 210

Dear All,

I have a number of accessions deposited in (i) a text file and (ii) a tab-delimited file;

(i)

calc.388956
calc.254634
calc.333221
calc.333250
..

(ii)

calc.388956    ENSDARG00000001015|ENSDART00000001148|1791
calc.388956    ENSDARG00000001015|ENSDART00000001148|1791
calc.254634    ENSDARG00000000966|ENSDART00000001092|6325
calc.114655    ENSDARG00000006612|ENSDART00000002247|2167 #not found in accession list
calc.333221    ENSDARG00000017121|ENSDART00000002309|3174
calc.333250    ENSDARG00000004697|ENSDART00000002554|1819
calc.329445    ENSDARG00000007812|ENSDART00000002247|2163
..

I am looking for a script for comparing these two files and find similar the accession numbers (i) in the tabular file (ii) and report the corresponding rows.

I used a script

grep -wFf accession.txt tabular_file.txt > result.txt

but, I think it is not working.

Thank you for all your help!

sequence • 7.6k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 9.0 years ago by vahapel ▴ 210

Ram · Accepted Answer · 2015-05-07

join -1 1 -2 1 -o 1.1,1.2 -t "literalTab" <(sort -k1,1 yourFileWithTwoColumns) <(sort -k1,1 yourFileWithOneColumn) > output

Join on the basis of the first column of file1 (-1 1) and the first column of file2 (-2 1) and output fields 1 and 2 from first file (the rows where first field matched to the first field of the second file). Literal tab is ctrl + v + tab. If your files are already sorted by first field you can skip the sort commands and just give the files as input. Join is orders of magnitude faster than grep when you have two large files as with grep you're going through the second file as many times as there are lines in the first file. If you don't mind creating temp files the above command can maybe be a little bit faster still if you parallelize the sort commands:

sort -k1,1 yourFileWithTwoColumns > yourFileWithTwoColumnsSorted.tmp & \
sort -k1,1 yourFileWithOneColumn > yourFileWithOneColumnSorted.tmp && \
join -1 1 -2 1 -o 1.1,1.2 -t "literalTab" yourFileWithTwoColumnsSorted.tmp yourFileWithOneColumnSorted.tmp > output && \
rm yourFileWithTwoColumnSorted.tmp yourFileWithOneColumnSorted.tmp

Ram · Accepted Answer · 2015-05-07

5

Entering edit mode

9.0 years ago

Devon Ryan 104k

The grep command you tried is likely working, but slowly. You could alternatively load both files into R and use inner_join() from the dplyr package. That's relatively quick. Writing a small program to do this in python would also be quite quick. In fact, it'd be something like:

#!/usr/bin/env python
file1="something.txt"
file2="something_else.txt"
IDs = set()

for line in open(file1,"r") :
    IDs.add(line.strip())

for line in open(file2,"r") :
    elements = line.strip().split("\t")
    if elements[0] in IDs :
        print "%s\t%s" % (elements[0], elements[1])

What I wrote above is a total hack, but something along those lines would work.

ADD COMMENT • link updated 14 months ago by Ram 43k • written 9.0 years ago by Devon Ryan 104k

0

Entering edit mode

I had a similar problem before, and Grep works, and several ways to speed up grep: set LC_ALL=C, use fgrep

http://crazyhottommy.blogspot.com/search/label/grep

ADD REPLY • link 9.0 years ago by Ming Tommy Tang ★ 3.9k

Ram · Accepted Answer · 2015-05-08

The following R script should do what you want:

#!/usr/bin/Rscript
args <- commandArgs(TRUE)
accession <- read.table(args[1], sep="\t", header = F)
tab.file <- read.table(args[2], sep="\t", header = F)
row.indices <- which(tab.file[,1] %in% accession[,1])
write.table(row.indices,file=args[3],quote=F,row.names=F,col.names=F)

just call it with:

./extract.R accession.txt tabular_file.txt out.txt

of course, you should have R installed, and you should not use this if your files are big ("big" depending on how much RAM do you have).

Ram · Accepted Answer · 2015-05-08

2

Entering edit mode

9.0 years ago

venu 7.1k

Change the flag to

grep -Fwf accession.txt tabular_file.txt > result

ADD COMMENT • link updated 14 months ago by Ram 43k • written 9.0 years ago by venu 7.1k