tabix for ID column
4
1
Entering edit mode
6.8 years ago

Hello,

I'm looking for something similar to tabix. But instead of looking for informations within a given region, I would like to use the values in the ID column for quickly lookup.

So for example I would like to take the compressed dbSNP file, index it by the ID column and than search quickly for the line with a given rs number.

Do you know any program or way that can do this?

fin swimmer

snp tabix • 4.1k views
ADD COMMENT
0
Entering edit mode

Don't overthink it. Link gzip and gunzip to pigz. Then:

$ gzip -c snps.vcf > snps.gz
$ zgrep rs1234 snps.gz > answer.txt

A database is way overkill.

ADD REPLY
0
Entering edit mode

tabix has a --targets option..

ADD REPLY
2
Entering edit mode
6.8 years ago

create a text file filled with RSID CHROM POS and sort on RSID.

awk '/^#/ {next;} ($3==".") {next;} {OFS="\t";print $3,$1,$2;}' input.vcf | sort -k1,1 > index.txt

when querying , use join build an interval and query with tabix

join -t $'\t' -1 1 -2 1 <(echo "rs12345") index.txt  |\
awk '{printf("%s:%s-%s\n",$2,$3,$3);}' |\
while read P;
  do tabix input.vcf "$P" 
done

another solution: put those ID/CHROM/POS in a sqlite3 database and query it the same way.

sqlite3 index.db 'select chrom,position from myindex where rsID="rs17625"' | awk -F '|' '{printf("%s:%s-%s\n",$1,$1,$1);}' |\
    while read P;
      do tabix input.vcf "$P" 
    done
ADD COMMENT
0
Entering edit mode

Is join going to be faster then a sequential scan using e.g. grep rs1234 or grep -f rsid.txt?

ADD REPLY
1
Entering edit mode

yes because join optimize the fact that the file is sorted: it will ignore everything before rs1234 and stop after rs1234.

of course , for speed you would use:

LC_ALL=C grep -w  '^rs1234'

but you don't know is there is more than one occurence of rs1234 in the file (it happens)

grep -w -f rsid.txt

would be very slow compared to

join -t $'\t' -1 1 -2 1  <(sort rsid.txt) index.txt
ADD REPLY
0
Entering edit mode

Sorry for my late answer,

I didn't have the time to try it until now, but I will.

But I suspect the index file will be huge if it is uncrompressed.

fin swimmer

ADD REPLY
1
Entering edit mode
6.8 years ago

You could put your table in a sqlite database and index the ID column. Then retrieving records by ID will be very fast. However, this would require working with a sqlite file and associated SQL syntax rather than with the usual unix/R/python tools.

ADD COMMENT
0
Entering edit mode

Hello,

yes using a database could be a way. But I'm afraid that the database file will be very huge. So working with a compressed file would be very nice.

fin swimmer

ADD REPLY
1
Entering edit mode
6.0 years ago

Hello,

long time ago I started this topic. Unfortunately I cannot comment on the solutions mentioned here because I dropped my idea for what I need it and so never tested it.

But today I found a nice trick on github of htslib. The person who posted their uses "rs" as contig name and the number as start and end position. This results in a file like this :

rs 76264143 76264143 chr1 982843 982844 G

You can than index by the first 3 columns and query like before. Nice!

fin swimmer

ADD COMMENT
0
Entering edit mode
6.0 years ago

Another option is a fast binary search on sorted data. No indexing needed, because the sort order is the index.

1) Get SNPs and write them into a text file sorted by SNP ID. For hg19, for instance:

$ LC_ALL=C
$ wget -qO- ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf.gz \
   | gunzip -c \
   | vcf2bed --do-not-sort \
   | awk -v OFS="\t" '{ print $4, $2, $3, ($3+1), $6"/"$7; }' \
   | sort -k1,1 \
   > hg19.snp150.sortedByName.txt

2) Install pts-line-bisect:

$ git clone https://github.com/pts/pts-line-bisect.git
$ cd pts-line-bisect && make && cd ..

3) Run a query:

$ rs_of_interest=rs12345
$ ./pts-line-bisect/pts_lbsearch -p hg19.snp150.sortedByName.txt ${rs_of_interest} | head -1 | cut -f2-
ADD COMMENT

Login before adding your answer.

Traffic: 2442 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6