Question

tabix for ID column

1

Entering edit mode

6.8 years ago

finswimmer 16k

Hello,

I'm looking for something similar to tabix. But instead of looking for informations within a given region, I would like to use the values in the ID column for quickly lookup.

So for example I would like to take the compressed dbSNP file, index it by the ID column and than search quickly for the line with a given rs number.

Do you know any program or way that can do this?

fin swimmer

snp tabix • 4.1k views

ADD COMMENT • link updated 2.6 years ago by Shicheng Guo ★ 9.4k • written 6.8 years ago by finswimmer 16k

0

Entering edit mode

Don't overthink it. Link gzip and gunzip to pigz. Then:

$ gzip -c snps.vcf > snps.gz
$ zgrep rs1234 snps.gz > answer.txt

A database is way overkill.

ADD REPLY • link 6.8 years ago by Alex Reynolds 35k

0

Entering edit mode

tabix has a --targets option..

ADD REPLY • link 2.6 years ago by Shicheng Guo ★ 9.4k

score 2 · Answer 1 · 2017-06-23

2

Entering edit mode

6.8 years ago

Pierre Lindenbaum 161k

create a text file filled with RSID CHROM POS and sort on RSID.

awk '/^#/ {next;} ($3==".") {next;} {OFS="\t";print $3,$1,$2;}' input.vcf | sort -k1,1 > index.txt

when querying , use join build an interval and query with tabix

join -t $'\t' -1 1 -2 1 <(echo "rs12345") index.txt  |\
awk '{printf("%s:%s-%s\n",$2,$3,$3);}' |\
while read P;
  do tabix input.vcf "$P" 
done

another solution: put those ID/CHROM/POS in a sqlite3 database and query it the same way.

sqlite3 index.db 'select chrom,position from myindex where rsID="rs17625"' | awk -F '|' '{printf("%s:%s-%s\n",$1,$1,$1);}' |\
    while read P;
      do tabix input.vcf "$P" 
    done

ADD COMMENT • link 6.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Is join going to be faster then a sequential scan using e.g. grep rs1234 or grep -f rsid.txt?

ADD REPLY • link 6.8 years ago by dariober 14k

1

Entering edit mode

yes because join optimize the fact that the file is sorted: it will ignore everything before rs1234 and stop after rs1234.

of course , for speed you would use:

LC_ALL=C grep -w  '^rs1234'

but you don't know is there is more than one occurence of rs1234 in the file (it happens)

grep -w -f rsid.txt

would be very slow compared to

join -t $'\t' -1 1 -2 1  <(sort rsid.txt) index.txt

ADD REPLY • link 6.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Sorry for my late answer,

I didn't have the time to try it until now, but I will.

But I suspect the index file will be huge if it is uncrompressed.

fin swimmer

ADD REPLY • link 6.8 years ago by finswimmer 16k

score 1 · Answer 2 · 2017-06-23

1

Entering edit mode

6.8 years ago

dariober 14k

You could put your table in a sqlite database and index the ID column. Then retrieving records by ID will be very fast. However, this would require working with a sqlite file and associated SQL syntax rather than with the usual unix/R/python tools.

ADD COMMENT • link 6.8 years ago by dariober 14k

0

Entering edit mode

Hello,

yes using a database could be a way. But I'm afraid that the database file will be very huge. So working with a compressed file would be very nice.

fin swimmer

ADD REPLY • link 6.8 years ago by finswimmer 16k

score 1 · Answer 3 · 2018-05-04

Hello,

long time ago I started this topic. Unfortunately I cannot comment on the solutions mentioned here because I dropped my idea for what I need it and so never tested it.

But today I found a nice trick on github of htslib. The person who posted their uses "rs" as contig name and the number as start and end position. This results in a file like this :

rs 76264143 76264143 chr1 982843 982844 G

You can than index by the first 3 columns and query like before. Nice!

fin swimmer

score 0 · Answer 4 · 2018-05-04

Another option is a fast binary search on sorted data. No indexing needed, because the sort order is the index.

1) Get SNPs and write them into a text file sorted by SNP ID. For hg19, for instance:

$ LC_ALL=C
$ wget -qO- ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf.gz \
   | gunzip -c \
   | vcf2bed --do-not-sort \
   | awk -v OFS="\t" '{ print $4, $2, $3, ($3+1), $6"/"$7; }' \
   | sort -k1,1 \
   > hg19.snp150.sortedByName.txt

2) Install pts-line-bisect:

$ git clone https://github.com/pts/pts-line-bisect.git
$ cd pts-line-bisect && make && cd ..

3) Run a query:

$ rs_of_interest=rs12345
$ ./pts-line-bisect/pts_lbsearch -p hg19.snp150.sortedByName.txt ${rs_of_interest} | head -1 | cut -f2-